1 Introduction

1.1 Background

The initial inspiration for this work was the Data for Democracy NYC Accessibility project (Project GitHub Repo), which focused on exploring the lack of ADA-accessible stations in the New York City subway system. For those that are unfamiliar, ADA-accessible subway stations are those that comply with the Americans with Disabilities Act of 1990 and, among other requirements, provide elevator service such that an individual can get from street level to the turnstiles and to the train platform without having to use a staircase.

New York City’s subway system is one of the busiest in the world, with a weekday average ridership of 5.6 million riders as of 2017, but it is also the least accessible transit system in the United States. While the ADA focuses on individuals with disabilities, an ADA-compliant system is friendlier to all, from parents with young children and strollers to the elderly. An excuse for the state of things might be the age of the system since the NYC subway is among one of the older systems in the country. However, other mass transit systems that opened around the same time, such as the ones in Boston and Chicago, have a far higher percentage of compliant stations.

The accessibility problem is a symptom of the larger set of issues that have plagued the city’s transit system for years now. The subways are slow and overcrowded, the trains manage to be on time only 65% of the time (although this has increased to 70% as of December), and the whole of the MTA, the body responsible for the subway along with other transit systems, has been financially mismanaged to the brink of ruin. As an example of said mismanagement, the MTA invested in its bus system and Access-a-Ride, a door-to-door service, in an effort to increase accessibility instead of converting subway stations. But, not only is a system of buses not comparable in terms of speed and service quality to the subway, this decision wound up costing the MTA more money than it would have to retrofit elevators in subway stations. As of today, plans are underway to fix the centuries-old signal systems, increase train speeds, and to install elevators at existing stations to make them accessible. But given the previous track record, there is cause for concern that the NYC subway system can turn it around, or maybe Elon Musk will save the day in the end.

1.2 Project Goals

1.2.1 Goals

The Office of NYC Comptroller released a report in July of 2018 on the state of accessibility in the NYC subway system. It included a geospatial analysis of NYC neighborhoods and subway station locations, as well as the potential financial impact of subway inaccessibility on the individuals living in those neighborhoods.

I will aim to recreate the map below that was featured in the report, which differentiates neighborhoods based on whether their boundaries contain at least one ADA-accessible station, at least one subway station but no ADA-accessible stations, or no subway station at all.

In addition, I will go further and resolve what I see as some of the problems with the report.

1.2.2 Problems with report and solutions:

Problem 1: NYC neighborhoods are areas that people can easily identify with, but they are very large in terms of geographic area.

Proposed solution: Use the 2010 Census tracts instead, which are smaller in area, allowing for a more fine-grained analysis.

Problem 2: The report counts stations that are only within the boundaries of a neighboorhood. This is limiting and may be an inaccurate representation of accessibility if the station is located on the edge of a large neighborhood.

Proposed solution: Consider stations only within a certain radius of a geographical point, such as the center of the census tracts.

Problem 3: Report focuses on the presence/absence of a subway station, but no info on how many stations are in the vicinity.

Proposed solution: Count unique route station stops, including the total number of stations and ADA-accessible stations, within a given geographical area.

1.3 Data science skills utilized

I had been curious about geospatial analysis for a while, and this project was a great excuse to learn more. However, this project also turned out to require quite a lot of data cleaning, especially for the Subway Entrances and Exits dataset.

In the end, the main skills utilized and demonstrated throughout this work are:

  • Data cleaning - I had severely underestimated how messy and out-of-date the NYC Subway Entrances and Exits dataset was. It needed quite a lot of manual curation to reach an analysis-ready state.
  • Geospatial analysis in R - mapping, spatial joins, converting non-spatial data into a spatial format, and more, mainly with the help of the sf package. As an aside, this DataCamp course on the sf package, led by Zev Ross, was immensely helpful as a foundation.

2 Setup

2.1 Packages

Packages used:

  • tidyverse - omnibus package for data import, wrangling, and cleaning
  • sf - geospatial analysis
  • ggthemes - ggplot2 theme and palette add-on
  • mapview - easy to use package for creating interactive maps
library(tidyverse)
library(sf)
library(ggthemes)
library(mapview)
# minimal theme for nice plots throughout the project
theme_set(theme_minimal())

2.2 Data

Shapefiles, imported using the st_read function from the sf package:

  • nyc.census.map: Shapefile of NYC 2010 census tract boundaries
  • nyc.neigh.map: Shapefile of NYC neighborhoods

Datasets, in .csv format:

  • subway.ent.exit: Subway Entrances and Exits dataset, which provides information about what train routes stop at each station, whether the stations are ADA-accessible or not, as well as the station latitude and longitude coordinates, but in a non-spatial format with no coordinate reference system.
  • subway.by.line: In NYC, the subway routes are grouped into “trunk lines”, typically based on their route through Manhattan. The grouping indicates which trains share many of the same statio stops. Each trunk line also has its own distinct assigned color that appears on the subway route symbols. This dataset provides information on the grouping of routes by trunk line and the color codes.
  • num.stat.by.rt.wiki: The number of station stops by route according to their Wikipedia pages. This will be used later to clean the subway entrances and exits dataset. As an aside, NYC subway service changes over the course of the day and the weekend, with some lines switching from express to local or going out of service altogether. In this analysis, I will be focusing on the weekday rush-hour service because, in theory, that is when the most people should be using the subway and it is most needed.
### shapefiles ###
nyc.census.map <- st_read("./data/nyct2010_18a/nyct2010.shp")
Reading layer `nyct2010' from data source `C:\Users\Dasha\Documents\GitHub\subway_accessibility_map\data\nyct2010_18a\nyct2010.shp' using driver `ESRI Shapefile'
Simple feature collection with 2166 features and 11 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
epsg (SRID):    NA
proj4string:    +proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0 +datum=NAD83 +units=us-ft +no_defs
nyc.neigh.map <- st_read("./data/nynta_18d/nynta.shp")
Reading layer `nynta' from data source `C:\Users\Dasha\Documents\GitHub\subway_accessibility_map\data\nynta_18d\nynta.shp' using driver `ESRI Shapefile'
Simple feature collection with 195 features and 7 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
epsg (SRID):    NA
proj4string:    +proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0 +datum=NAD83 +units=us-ft +no_defs
### subway info ###
subway.ent.exit <- read_csv("./data/2018_update/NYC_Transit_Subway_Entrance_And_Exit_Data.csv")
subway.by.line <- read_csv("./data/2018_update/nyc_subway_stations_grouped.csv")
num.stat.by.rt.wiki <- read_csv("./data/2018_update/nyc_subway_num_stat_by_line.csv")

Data sources:

2.3 Data Preview

The NYC census tract and neighborhood sf files can be visualized in a number of ways, including with the ggplot2 package and the plot function. Here, ggplot is used to layer the census tracts in blue and the neighborhood boundaries in orange to demonstrate the difference, for those not familiar with the city.

# nyc census tracts map in blue
nyc.census.map %>% 
  ggplot() +
  geom_sf(color = "#1F77B4") +
  # nyc neighborhoods outlines overlaid in orange
  geom_sf(data = nyc.neigh.map, color = "#FF7F0E", size = 1, fill = NA) +
  ggtitle("NYC Map") +
  xlab("Longitude") +
  ylab("Latitude")

Most neighborhoods include multiple census tracts, with the exception of parks and airports, and census tracts are contained within neighborhoods. Both shapefiles include Staten Island, for which there is no data in the subway entrances/exits dataset and therefore it will be removed from further consideration.

A very useful feature of the sf format is that these objects are data frames and can be treated as such for the purposes of filtering, joining, and other manipulations, as well as being spatial objects.

summary(nyc.census.map)
    CTLabel     BoroCode          BoroName       CT2010       BoroCT2010  
 138    :   5   1:288    Bronx        :339   003300 :   5   1000100:   1  
 151    :   5   2:339    Brooklyn     :760   003900 :   5   1000201:   1  
 247    :   5   3:760    Manhattan    :288   007500 :   5   1000202:   1  
 251    :   5   4:668    Queens       :668   007700 :   5   1000500:   1  
 279    :   5   5:111    Staten Island:111   013800 :   5   1000600:   1  
 33     :   5                                015100 :   5   1000700:   1  
 (Other):2136                                (Other):2136   (Other):2160  
 CDEligibil    NTACode                    NTAName          PUMA     
 E:1002     BK50   :  34   Canarsie           :  34   4009   :  83  
 I:1164     BK31   :  28   Bay Ridge          :  28   4112   :  80  
            BK88   :  28   Borough Park       :  28   4105   :  68  
            BK58   :  27   Crown Heights North:  27   4103   :  65  
            BK61   :  27   East New York      :  27   4110   :  65  
            BK82   :  27   Flatlands          :  27   4101   :  59  
            (Other):1995   (Other)            :1995   (Other):1746  
   Shape_Leng         Shape_Area                 geometry   
 Min.   :   168.5   Min.   :      582   MULTIPOLYGON :2166  
 1st Qu.:  5622.6   1st Qu.:  1683579   epsg:NA      :   0  
 Median :  6496.8   Median :  1987942   +proj=lcc ...:   0  
 Mean   :  8726.0   Mean   :  3891726                       
 3rd Qu.:  8734.1   3rd Qu.:  3189156                       
 Max.   :186126.0   Max.   :196238540                       
                                                            

For example, as demonstrated above, the nyc.census.map object includes the tract codes, the city borough name that the tract belongs to, as well as the spatial geometry.

As for the other datasets, the Subway Entrances and Exits dataset contains the most useful and relevant information for this project. However, it will require considerable transformation to get it into a usable format.

# subway entrances/exit data:
glimpse(subway.ent.exit)
Observations: 1,868
Variables: 30
$ Division             <chr> "BMT", "BMT", "BMT", "BMT", "BMT", "BMT",...
$ Line                 <chr> "4 Avenue", "4 Avenue", "4 Avenue", "4 Av...
$ `Station Name`       <chr> "25th St", "25th St", "36th St", "36th St...
$ `Station Latitude`   <dbl> 40.66040, 40.66040, 40.65514, 40.65514, 4...
$ `Station Longitude`  <dbl> -73.99809, -73.99809, -74.00355, -74.0035...
$ Route1               <chr> "R", "R", "N", "N", "N", "R", "R", "R", "...
$ Route2               <chr> NA, NA, "R", "R", "R", NA, NA, NA, NA, NA...
$ Route3               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Route4               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Route5               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Route6               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Route7               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Route8               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Route9               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Route10              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Route11              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ `Entrance Type`      <chr> "Stair", "Stair", "Stair", "Stair", "Stai...
$ Entry                <chr> "YES", "YES", "YES", "YES", "YES", "YES",...
$ `Exit Only`          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Vending              <chr> "YES", "YES", "YES", "YES", "YES", "YES",...
$ Staffing             <chr> "FULL", "NONE", "FULL", "FULL", "FULL", "...
$ `Staff Hours`        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ ADA                  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
$ `ADA Notes`          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ `Free Crossover`     <lgl> FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRU...
$ `North South Street` <chr> "4th Ave", "4th Ave", "4th Ave", "4th Ave...
$ `East West Street`   <chr> "25th St", "25th St", "36th St", "36th St...
$ Corner               <chr> "SE", "SW", "NW", "NE", "NW", "NE", "NW",...
$ `Entrance Latitude`  <dbl> 40.66032, 40.66049, 40.65449, 40.65436, 4...
$ `Entrance Longitude` <dbl> -73.99795, -73.99822, -74.00450, -74.0041...
# subway summary data
glimpse(subway.by.line)
Observations: 10
Variables: 4
$ `Primary Trunk line` <chr> "IND Eighth Avenue Line", "IND Sixth Aven...
$ Color                <chr> "Vivid blue", "Bright orange", "Lime gree...
$ Hexadecimal          <chr> "#2850ad", "#ff6319", "#6cbe45", "#a7a9ac...
$ Lines                <chr> "A_C_E", "B_D_F_M", "G", "L", "J_Z", "N_Q...
# num of station stops by route:
glimpse(num.stat.by.rt.wiki)
Observations: 25
Variables: 4
$ route_name        <chr> "A", "C", "E", "B", "D", "F", "M", "G", "L",...
$ num_stations_norm <dbl> 44, 40, 22, 27, 36, 45, 36, 21, 24, 30, 21, ...
$ late_night        <dbl> 66, NA, 32, NA, 41, NA, 8, NA, NA, NA, NA, 4...
$ limited           <dbl> NA, NA, 19, 37, NA, NA, 13, NA, NA, 20, NA, ...

Preliminary to-do list:

  • Remove Staten Island from the maps, the subway datasets do not include Staten Island transit routes
  • subway.ent.exit: The entrances and exits dataset is very messy. It needs cleaning and the route columns need to be reorgranized into a tidy format. Will also need to evaluate how useful some of the other columns about the specific entrances and exits data might be.
  • subway.by.line: Needs to be tidied, with each subway route as its own row.
  • num.stat.by.rt.wiki: Is fine as is.

3 Preliminary exploration and cleaning

3.1 Basic cleaning

For the maps, Staten Island will be filtered out, the borough name will be converted to lowercase and the column names will be converted to lowercase for later convenience.

nyc.census.4boro <- nyc.census.map %>%  
  filter(BoroName != "Staten Island") %>% 
  mutate(BoroName = str_to_lower(BoroName)) %>% 
  `colnames<-`(str_to_lower(colnames(nyc.census.map)))
nyc.neigh.4boro <- nyc.neigh.map %>% 
  filter(BoroName != "Staten Island") %>% 
  mutate(BoroName = str_to_lower(BoroName))%>% 
  `colnames<-`(str_to_lower(colnames(nyc.neigh.map)))
# new column names:
colnames(nyc.census.4boro)
 [1] "ctlabel"    "borocode"   "boroname"   "ct2010"     "boroct2010"
 [6] "cdeligibil" "ntacode"    "ntaname"    "puma"       "shape_leng"
[11] "shape_area" "geometry"  
colnames(nyc.neigh.4boro)
[1] "borocode"   "boroname"   "countyfips" "ntacode"    "ntaname"   
[6] "shape_leng" "shape_area" "geometry"  
ggplot(nyc.neigh.4boro) +
  geom_sf() +
  ggtitle("NYC Neighborhoods Map\nNo Staten Island") +
  xlab("Longitude") +
  ylab("Latitude")

For the other dataframes, first cleanup the column names of the two subway info dataframes by converting the words to lower-case and by replacing empty spaces to make them easier to work with:

colnames(subway.ent.exit) <- colnames(subway.ent.exit) %>% 
  str_to_lower() %>% 
  str_replace_all(" ", "_")
colnames(subway.ent.exit)
 [1] "division"           "line"               "station_name"      
 [4] "station_latitude"   "station_longitude"  "route1"            
 [7] "route2"             "route3"             "route4"            
[10] "route5"             "route6"             "route7"            
[13] "route8"             "route9"             "route10"           
[16] "route11"            "entrance_type"      "entry"             
[19] "exit_only"          "vending"            "staffing"          
[22] "staff_hours"        "ada"                "ada_notes"         
[25] "free_crossover"     "north_south_street" "east_west_street"  
[28] "corner"             "entrance_latitude"  "entrance_longitude"
colnames(subway.by.line) <- colnames(subway.by.line) %>% 
  str_to_lower() %>% 
  str_replace_all(" ", "_")
colnames(subway.by.line)
[1] "primary_trunk_line" "color"              "hexadecimal"       
[4] "lines"             

Much better!

Next, the lines column in the subway.by.line dataset needs to be separated so that each subway line is its own row.

sub.line.tidy <- subway.by.line %>% 
  # separate the route names into their own columns (wide format)
  separate(lines, into = c(paste("route", 1:4, sep = "_")), remove = FALSE) %>% 
  # gather the separated route names into one column (long format)
  gather("route_num", "route_name", route_1:route_4) %>% 
  # remove extra rows created from extra columns where there fewer than 4 routes for that line
  filter(!is.na(route_name)) %>% 
  # unnecessary column: 
  select(-route_num) %>% 
  arrange(route_name)
head(sub.line.tidy)
# A tibble: 6 x 5
  primary_trunk_line               color       hexadecimal lines route_name
  <chr>                            <chr>       <chr>       <chr> <chr>     
1 IRT Broadway-Seventh Avenue Line Tomato red  #ee352e     1_2_3 1         
2 IRT Broadway-Seventh Avenue Line Tomato red  #ee352e     1_2_3 2         
3 IRT Broadway-Seventh Avenue Line Tomato red  #ee352e     1_2_3 3         
4 IRT Lexington Avenue Line        Apple green #00933c     4_5_6 4         
5 IRT Lexington Avenue Line        Apple green #00933c     4_5_6 5         
6 IRT Lexington Avenue Line        Apple green #00933c     4_5_6 6         

Now that that dataset is clean, it is possible to check the route names against the entrances/exits data in order to determine if there is anything odd or any typos:

subway.ent.exit %>% 
  # gather the unique route names across all of the route columns in the subway ent/exit dataset:
  select(route1:route11) %>% 
  gather("route_num", "route_name") %>% 
  filter(!(is.na(route_name))) %>% 
  select(-route_num) %>% 
  distinct() %>% 
  # any route names in the ent/exit df that are not in the official route list on the wiki?
  anti_join(sub.line.tidy, by = "route_name")
# A tibble: 4 x 1
  route_name
  <chr>     
1 GS        
2 FS        
3 e         
4 H         

Yes, it looks like there are four routes in the subway entrance/exit data that are not in the route list. However, the e is simply a typo that should be changed to E, which is a real route, and the three routes GS, FS, and H all fall under the umbrella of S in the wiki route list. They are their own separate, relatively short routes, that all fall under the designation of shuttles. GS is the 42nd St shuttle in Manhattan that only stops in Times Square and Grand Central. FS is the Franklin Avenue shuttle that operates between Franklin Ave and Prospect Park in Brooklin. Lastly, H is the Rockaways shuttle.

On to the subway entrances/exits dataset itself. First and foremost, according to wikipedia, there should be what is considered by the city 472 subway stations or 424 if the connected stations are considered single stations. I expect that the raw dataset will be off in some regard, since it was not even updated with the Second Avenue subway stations and the changes to Q route service, which also added 3 stations in Manhattan.

# number of unique subway station names:
length(unique(subway.ent.exit$station_name))
[1] 356

That is far too few stations, but that is not a surprise given that station names are often reused, for example there are 5 “23rd Street” stations in Manhattan and 1 in Queens. Therefore, the station name column alone cannot be used as a unique key for the stations in this dataset.

What about other alternatives? The subway by line dataset had special codes and names to refer to the trunk line, such as IRT and BMT, which matches the division and line columns in the entrances/exits data. I suspect that the division, line, and station_name columns will give the unique identifier for each station. But how many unique combinations of those 3 columns are there?

subway.ent.exit %>% 
  select(division:station_name) %>% 
  distinct() %>% 
  head()
# A tibble: 6 x 3
  division line     station_name
  <chr>    <chr>    <chr>       
1 BMT      4 Avenue 25th St     
2 BMT      4 Avenue 36th St     
3 BMT      4 Avenue 45th St     
4 BMT      4 Avenue 53rd St     
5 BMT      4 Avenue 59th St     
6 BMT      4 Avenue 77th St     
subway.ent.exit %>% 
  select(division:station_name) %>% 
  distinct() %>% 
  nrow()
[1] 465

There are 465 such combinations, which is close to the 472 number. Adding in 3 missing new Q-train stations brings it up to 468, but there may be more stations missing than I thought.

As an aside, how many station name, latitude, and longitude combinations are there?

subway.ent.exit %>% 
  select(station_name, station_latitude:station_longitude) %>% 
  distinct() %>% 
  nrow()
[1] 473

There are more of those combinations than there are stations. It turns out that some stations had multiple coordinates, which is likely related to the entrances and exits, or maybe due to connected stations.

Now for some cleanup of the subway entrance/exit dataset:

  • Create a (theoretically) unique station name column by combining the division. line, and station_name
  • Fix the capitalization typo
sub.ent.w.key <- subway.ent.exit %>% 
  mutate_at(vars(division:station_name), str_to_lower) %>% 
  # create a unique key for each station
  unite("stat_name", division:station_name, sep = "_") %>% 
  # capitalize all of the route names (to fix the e issue)
  mutate_at(vars(route1:route11), str_to_upper) 
sub.ent.w.key <- sub.ent.w.key %>% 
  select(-c(station_latitude:station_longitude)) %>% 
  distinct() %>% 
  left_join(
    sub.ent.w.key %>% 
      select(stat_name:station_longitude) %>% 
      distinct() %>% 
      group_by(stat_name) %>% 
      summarize(
        avg_stat_lat = mean(station_latitude),
        avg_stat_long = mean(station_longitude)
        ),
    by = "stat_name"
    )
length(unique(sub.ent.w.key$stat_name))
[1] 465
sub.ent.w.key %>% 
  select(stat_name, avg_stat_lat:avg_stat_long) %>% 
  distinct() %>% 
  nrow()
[1] 465

Now the number of unique station names and the combination of station names and geographical locations are equal.

3.2 Entrance Analysis

Apart from the subway station location information, the entrances and exits dataset provides information on, well, the entrances and exits. What additional information does this provide? Is it useful in any way and does it give some extra insights? How many of each entrance type are there and what is the relationship between the entrance types and the ADA-accessibility rating of the station?

# take unique station name and entrance information, as well as the ada rating
sub.entrances <- sub.ent.w.key %>% 
  select(stat_name, entrance_latitude:entrance_longitude, entrance_type:entry, ada) %>%
  distinct()
# result:
head(sub.entrances)
# A tibble: 6 x 6
  stat_name     entrance_latitu~ entrance_longit~ entrance_type entry ada  
  <chr>                    <dbl>            <dbl> <chr>         <chr> <lgl>
1 bmt_4 avenue~             40.7            -74.0 Stair         YES   FALSE
2 bmt_4 avenue~             40.7            -74.0 Stair         YES   FALSE
3 bmt_4 avenue~             40.7            -74.0 Stair         YES   FALSE
4 bmt_4 avenue~             40.7            -74.0 Stair         YES   FALSE
5 bmt_4 avenue~             40.7            -74.0 Stair         YES   FALSE
6 bmt_4 avenue~             40.6            -74.0 Stair         YES   FALSE
table(sub.entrances$entrance_type)

     Door  Easement  Elevator Escalator      Ramp     Stair   Walkway 
       80        91        48        27         3      1613         1 
table(sub.entrances$ada)

FALSE  TRUE 
 1399   464 

Stairs are by far the most common entrance/exit type, which is something new learned, but the ADA count by itself doesn’t add much. I’m curious what the relationship between the ADA rating and the entrance type is per station:

sub.entrances %>% 
  group_by(stat_name) %>% 
  summarize(
    # count total number of entrances:
    num_entry = n(),
    # ada is TRUE/FALSE - can sum to get number of ada = TRUE per station:
    num_ada = sum(ada),
    # % ada out of total num of entrances, per station:
    percent_ada = num_ada * 100 / num_entry
  ) %>% 
  {table(.$percent_ada)}

  0 100 
381  84 

The result is either the stations are either 0% ADA or 100% ADA, which indicates that the ADA TRUE/FALSE rating is given to the entire station, not the particular entrance/exit. This suddenly makes the entrance columns a lot less interesting to me, and I will remove them from consideration after a few more plots.

!: WHat are the most comment entrance types for ADA-accessible and not accessible stations?

sub.entrances %>% 
  group_by(entrance_type, ada) %>% 
  count() %>% 
  ggplot(aes(entrance_type, n, fill = ada)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  xlab("Entrance Type") +
  ylab("Count") +
  scale_fill_tableau()

Stairs by far are the most common, and in general stations that have stair entrances are more often ADA = FALSE than not, but the size of the Stairs bars dominates the plot, so what about a different view?

sub.entrances %>% 
  group_by(entrance_type, ada) %>% 
  count() %>% 
  ggplot(aes(entrance_type, n, fill = ada)) +
  geom_bar(stat = "identity", position = "fill") +
  theme_minimal() +
  xlab("Entrance Type") +
  ylab("Proportion") +
  scale_fill_tableau()

This alternative view shows that Elevator entrances tend to denote ADA-accessible stations, more often than not. I tried to search, but I’m not sure what an “easement” entrance may be, it sounds like a private entrance, perhaps for utility work. A typical accessible routes are elevators and escalators, although I believe an elevator is required for a station to be fully ADA-accessible. (source?)

3.3 Removing entrance info and tidying more

Now to remove the entrance information, since it does not seem that useful for my purpose, and to convert the route columns into a tidy format. In order to replicate the geospatial analysis done in the official report, the route information is not necessary and could be dropped at this point. However, my interest lies in how many stations are located within certain spatial areas. I intend to treat each train route stop individually and to count each train that stops at a station individual. So if 3 trains, say the 4, 5, and 6 trains, all stop at one station, I would like to be able to count each of those because the more trains stop in a given area, the more “accessible” or reachable by subway that area is.

sub.ent.sml <- sub.ent.w.key %>% 
  # select relevant columns, discard all others from this point on
  select(stat_name, avg_stat_lat:avg_stat_long, route1:route11, ada, ada_notes) %>% 
  distinct() %>% 
  # reformat the route columns into a long format
  gather("route_num", "route_name", route1:route11) %>% 
  # get rid of the many NAs in the route column that were there due to the formatting
  filter(!is.na(route_name)) %>% 
  select(-route_num) %>% 
  distinct()
# new format
dim(sub.ent.sml)
[1] 985   6
head(sub.ent.sml)
# A tibble: 6 x 6
  stat_name           avg_stat_lat avg_stat_long ada   ada_notes route_name
  <chr>                      <dbl>         <dbl> <lgl> <chr>     <chr>     
1 bmt_4 avenue_25th ~         40.7         -74.0 FALSE <NA>      R         
2 bmt_4 avenue_36th ~         40.7         -74.0 FALSE <NA>      N         
3 bmt_4 avenue_45th ~         40.6         -74.0 FALSE <NA>      R         
4 bmt_4 avenue_53rd ~         40.6         -74.0 FALSE <NA>      R         
5 bmt_4 avenue_59th ~         40.6         -74.0 FALSE <NA>      N         
6 bmt_4 avenue_77th ~         40.6         -74.0 FALSE <NA>      R         
# number of unique stations and train route combinations:
sub.ent.sml %>% 
  select(stat_name, route_name) %>% 
  distinct() %>% 
  nrow()
[1] 980
# missing values check:
sapply(sub.ent.sml, anyNA)
    stat_name  avg_stat_lat avg_stat_long           ada     ada_notes 
        FALSE         FALSE         FALSE         FALSE          TRUE 
   route_name 
        FALSE 

At least formatting wise, the subway station and route dataset is much neater, with each route now on its own row instead of in columns, along with the ADA rating and station location. I’ve also taken along the ada_notes column which is not NA for a small number of stations and should be easier to explore now that the dataset of interest is much smaller.

3.4 Trouble ahead

sub.ent.sml %>% 
  select(stat_name, ada, ada_notes) %>% 
  distinct() %>% 
  filter(!(is.na(ada_notes)))
# A tibble: 13 x 3
   stat_name                                  ada   ada_notes        
   <chr>                                      <lgl> <chr>            
 1 ind_8 avenue_50th st                       TRUE  Southbound Only  
 2 ind_8 avenue_world trade center            TRUE  Construction     
 3 ind_archer av_sutphin blvd-archer av - jfk TRUE  Check            
 4 bmt_broadway_49th st                       TRUE  Northbound Only  
 5 bmt_broadway_times square-42nd st          TRUE  Shuttle not ADA  
 6 bmt_broadway_union square                  TRUE  Lex not ADA      
 7 bmt_canarsie_union square                  TRUE  Lex not ADA      
 8 ind_concourse_kingsbridge rd               FALSE in planning      
 9 irt_lexington_23rd st                      FALSE In Planning      
10 irt_lexington_brooklyn bridge-city hall    TRUE  J Z not ADA      
11 irt_lexington_canal st                     TRUE  Bway Nass not ADA
12 irt_pelham_hunts point av                  FALSE in planning      
13 ind_queens boulevard_forest hills-71st av  FALSE in planning      

These notes were the first real sign of trouble ahead.

Here are some of the next action plans based on these notes: * Some ADA changes were under construction or in planning at the time - need to check if construction at these stations has been completed and the ADA = FALSE can be changed to TRUE * Some of the stations are ADA-accessible in only one direction (northbound or southbound only). In the case of this analysis, I decided to allow ADA = TRUE for the entire station if it was ADA-accessible in at least one direction, since for my purposes it did not make sense to count a station as “half-accessible”. * Some stations, even though from the previous steps it was determined that the stations were either all ADA-accessible or not, are not fully accessible to all the trains that stop at that station. For example, the Union Square station is rated as ADA-accessible for the L, N, Q, R, and W trains, but not for the Lexington Ave 4, 5, and 6 trains. This is a consequence of the original formatting of the data. Because for me it is important to get the count right for each individual train, I would need to correct this and change ADA = FALSE where appropriate. * Some stations seem to be listed twice because different trunk lines stop there. For example, again, Union Square, which is considered one station, but is listed as bmt_broadway_union square for the N/Q/R/Q trunk line and bmt_canarsie_union square for the L trunk line.

This was my first real sign of just how messy this dataset is in its current state. Most worrying for my purposes was that it turned out that for each connected station, all the train route names are repeated. For example, there is a World Trade Center stop where only the E stops. But some subway stations are connected underground by tunnels and one can transfer from one station to another.

If I filter for this station only, where only the E should stop, I instead get 5 trains at this station as a result:

sub.ent.sml %>% 
  filter(stat_name == "ind_8 avenue_world trade center")
# A tibble: 5 x 6
  stat_name          avg_stat_lat avg_stat_long ada   ada_notes  route_name
  <chr>                     <dbl>         <dbl> <lgl> <chr>      <chr>     
1 ind_8 avenue_worl~         40.7         -74.0 TRUE  Construct~ A         
2 ind_8 avenue_worl~         40.7         -74.0 TRUE  Construct~ C         
3 ind_8 avenue_worl~         40.7         -74.0 TRUE  Construct~ E         
4 ind_8 avenue_worl~         40.7         -74.0 TRUE  Construct~ 2         
5 ind_8 avenue_worl~         40.7         -74.0 TRUE  Construct~ 3         

This is because the World Trace Center stop is connected to the Park Place stop, where the 2 and 3 stop, and Chambers St, where the A and C stop.

Now, if I select for the Park Place station information:

sub.ent.sml %>% 
  filter(stat_name == "irt_clark_park place")
# A tibble: 6 x 6
  stat_name           avg_stat_lat avg_stat_long ada   ada_notes route_name
  <chr>                      <dbl>         <dbl> <lgl> <chr>     <chr>     
1 irt_clark_park pla~         40.7         -74.0 FALSE <NA>      A         
2 irt_clark_park pla~         40.7         -74.0 FALSE <NA>      C         
3 irt_clark_park pla~         40.7         -74.0 FALSE <NA>      E         
4 irt_clark_park pla~         40.7         -74.0 FALSE <NA>      1         
5 irt_clark_park pla~         40.7         -74.0 FALSE <NA>      2         
6 irt_clark_park pla~         40.7         -74.0 FALSE <NA>      3         

Thankfully, the ADA rating is separate from the World Trade Center stop and is correctly marked as FALSE, but the issue is that all of the trains are also listed as stopping at this station, even though only the 2 and 3 trains should stop here, on top of the 1 train being listed here for some reason, even though the 1 route does not follow the 2 and 3 in this part of Manhattan.

More worrying, is that based on the data I got from wikipedia, the associated “line” names with the IND division, for example, should only be:

sub.line.tidy %>% 
  filter(str_detect(primary_trunk_line, "IND")) %>% 
  select(primary_trunk_line) %>% 
  distinct()
# A tibble: 3 x 1
  primary_trunk_line    
  <chr>                 
1 IND Eighth Avenue Line
2 IND Sixth Avenue Line 
3 IND Crosstown Line    

However, the line names for the subway entrances/exit data are far more varied:

sub.ent.sml %>%
  select(stat_name) %>% 
  distinct() %>% 
  # select only the IND division subway station stops
  filter(str_detect(stat_name, "ind")) %>% 
  # separate out the different name parts again
  separate(stat_name, into = c("division", "line", "station_name"), sep = "_") %>% 
  {table(.$line)}

        6 avenue      63rd street         8 avenue        archer av 
              20                3               31                3 
       concourse        crosstown           culver           fulton 
              11               12               10               15 
         liberty queens boulevard         rockaway 
               7               24               14 

Yes, the 8th Ave, 6th Ave, and Crosstown lines are there (although in a different format), but so are a number of other line names. This suggests that I cannot reliably use the line information, and maybe even the division codes to filter the subway station and train route combinations.

It also seems that the dataset was far more out of date than I expected. For example, the G Train Route is missing station stops between Smith 9th St and Church Ave, where the service was extended to some time ago. Luckily, in this case at least, those stations could just be added in by adding in a part of the F train route.

# transform the neighborhood map crs to march the subway entrance/exit data:
nyc.map.4boro.stat.crs <- st_transform(nyc.neigh.4boro, crs = "+init=epsg:4326")
nyc.map.4boro.stat.crs %>% 
  ggplot() +
  geom_sf(fill = NA) +
  geom_point(
    data = sub.ent.sml %>% 
      filter(route_name == "G"),
    aes(avg_stat_long, avg_stat_lat, color = route_name),
    size = 2, alpha = 0.8
    ) +
  xlab("Longitude") +
  ylab("Latitude") +
  theme(legend.position = "none") +
  ggtitle("G Train Route (Outdated)")

# should add those in - follows the F train here, get the slice of stations: 
sub.ent.sml %>% 
  filter(route_name == "F") %>% 
  arrange(avg_stat_lat) %>% 
  filter(avg_stat_lat > 40.62976 & avg_stat_lat < 40.68030)
# A tibble: 8 x 6
  stat_name           avg_stat_lat avg_stat_long ada   ada_notes route_name
  <chr>                      <dbl>         <dbl> <lgl> <chr>     <chr>     
1 ind_culver_ditmas ~         40.6         -74.0 FALSE <NA>      F         
2 ind_6 avenue_churc~         40.6         -74.0 TRUE  <NA>      F         
3 ind_6 avenue_fort ~         40.7         -74.0 FALSE <NA>      F         
4 ind_6 avenue_prosp~         40.7         -74.0 FALSE <NA>      F         
5 ind_6 avenue_7th av         40.7         -74.0 FALSE <NA>      F         
6 ind_6 avenue_4th av         40.7         -74.0 FALSE <NA>      F         
7 bmt_4 avenue_9th st         40.7         -74.0 FALSE <NA>      F         
8 ind_6 avenue_smith~         40.7         -74.0 FALSE <NA>      F         
# G train is considered IND - grab that 4th ave station
sub.ent.sml <- sub.ent.sml %>% 
  bind_rows(
    sub.ent.sml %>% 
      filter(route_name == "F") %>% 
      arrange(avg_stat_lat) %>% 
      filter(avg_stat_lat > 40.63612 & avg_stat_lat < 40.67358 & str_detect(stat_name, "ind")) %>% 
      mutate(route_name = "G")
)
# Updated G train route:
nyc.map.4boro.stat.crs %>% 
  ggplot() +
  geom_sf(fill = NA) +
  geom_point(
    data = sub.ent.sml %>% 
      filter(route_name == "F" | route_name == "G"),
    aes(avg_stat_long, avg_stat_lat, color = route_name),
    size = 2, alpha = 0.8
    )+
  xlab("Longitude") +
  ylab("Latitude") +
  scale_color_discrete(name = "Route") +
  ggtitle("G and F Train Routes (Updated)")

The updated route length is correct, but now there are too many stations than expected:

# new number of G train stations:
sub.ent.sml %>%
  filter(route_name == "G") %>% 
  nrow()
## [1] 25
num.stat.by.rt.wiki %>% 
  filter(route_name == "G")
## # A tibble: 1 x 4
##   route_name num_stations_norm late_night limited
##   <chr>                  <dbl>      <dbl>   <dbl>
## 1 G                         21         NA      NA

At this point, I’m starting to think that I’ll need to manually check through the station and train associations, but first I will go back to the ada_notes issue and fix that before moving on.

3.5 ADA fixes and hitting a wall

To fix:

  • ind_8 avenue_50th st / TRUE / Southbound Only - Leave as ada = TRUE (no changes)
  • ind_8 avenue_world trade center / TRUE / Construction - ada = TRUE for E only
  • ind_archer av_sutphin blvd-archer av - jfk / TRUE / Check - ada = TRUE according to wikipedia page (no change)
  • bmt_broadway_49th st / TRUE / Northbound Only - Leave as ada = TRUE (no change)
  • bmt_broadway_times square-42nd st / TRUE / Shuttle not ADA - set S to ada = FALSE
  • bmt_broadway_union square / TRUE / Lex not ADA - set 4/5/6 to ada = FALSE
  • bmt_canarsie_union square / TRUE / Lex not ADA - same as above
  • ind_concourse_kingsbridge rd / FALSE / in planning - switch to TRUE because it is complete
  • irt_lexington_23rd st / FALSE In Planning - switch to TRUE because it is complete
  • irt_lexington_brooklyn bridge-city hall / TRUE / J Z not ADA - set J/Z to ada = FALSE
  • irt_lexington_canal st / TRUE / Bway Nass not ADA - Only set 6 ada = TRUE
  • irt_pelham_hunts point av / FALSE / in planning - complete, set ada = TRUE
  • ind_queens boulevard_forest hills-71st av / FALSE / in planning - complete, set ada = TRUE

An additional small fixt that I noticeds is that there are two redundant routes: GS and S routes that refer to the same thing, the grand central shuttle between times square and grand central. Therefore I will filter out the “S” route and change the GS ADA column to FALSE, since it is not ADA-accessible at neither Times Square nor Grand Central.

I chose to keep each change as a separate mutate call simply to keep track of the changes, although it is inefficient.

sub.ent.ada.updt <- sub.ent.sml %>% 
  # change world trade center station ada
  mutate(ada = ifelse((stat_name == "ind_8 avenue_world trade center" & route_name != "E"), FALSE, ada)) %>% 
  # change times square shuttle ada, filter out extra "S" route
  filter(route_name != "S") %>% 
  mutate(ada = ifelse(route_name == "GS", FALSE, ada)) %>% 
  # change 4/5/6 at union square to ada = FALSE
  mutate(ada = ifelse(
      ((stat_name == "bmt_broadway_union square" | stat_name == "bmt_canarsie_union square") &
         (route_name == "4" | route_name == "5" | route_name == "6")), 
      FALSE, ada
      )) %>% 
  # change kingsbridge rd ada = TRUE
  mutate(ada = ifelse(stat_name == "ind_concourse_kingsbridge rd", TRUE, ada)) %>% 
  # change Lex / 23rd St stop to TRUE
  mutate(ada = ifelse(stat_name == "irt_lexington_23rd st", TRUE, ada)) %>% 
  # change J/Z at Brooklyn Bridge / City Hall to ada = FALSE
  mutate(ada = ifelse(
    (stat_name == "irt_lexington_brooklyn bridge-city hall" & (route_name == "J" | route_name == "Z")),
    FALSE, ada
    )) %>% 
  # change irt_lexington_canal st to ada = TRUE for 6 train only
  mutate(ada = ifelse(
    (stat_name == "irt_lexington_canal st" & route_name != "6"), FALSE, ada
    )) %>% 
  # change rt 6 hunts point av to ada = TRUE
  mutate(ada = ifelse(stat_name == "irt_pelham_hunts point av", TRUE, ada)) %>% 
  # convert the forst hills / 71st ave station to ada = TRUE for all lines
  mutate(ada = ifelse(stat_name == "ind_queens boulevard_forest hills-71st av", TRUE, ada)) %>% 
  # can get rid of the ada_notes column now
  select(-ada_notes) %>% 
  distinct()
dim(sub.ent.sml)
[1] 990   6
dim(sub.ent.ada.updt)
[1] 982   5
sub.ent.sml %>% 
  filter(route_name == "S")
# A tibble: 3 x 6
  stat_name         avg_stat_lat avg_stat_long ada   ada_notes   route_name
  <chr>                    <dbl>         <dbl> <lgl> <chr>       <chr>     
1 ind_8 avenue_42n~         40.8         -74.0 TRUE  <NA>        S         
2 bmt_broadway_tim~         40.8         -74.0 TRUE  Shuttle no~ S         
3 irt_broadway-7th~         40.8         -74.0 TRUE  <NA>        S         
sub.ent.ada.updt %>% 
  filter(route_name == "S")
# A tibble: 0 x 5
# ... with 5 variables: stat_name <chr>, avg_stat_lat <dbl>,
#   avg_stat_long <dbl>, ada <lgl>, route_name <chr>

Got rid of a few rows by eliminating the redundant S route, and by removing the ada_notes column.

Now back to the previous problem of trains being assigned to subway stations that they do not stop at and all the previous problems brought up earlier.

How does the number of stations in the subway entrances/exit dataset per route compare to the expected number (according to wikipedia)?

sub.ent.rt.count <- sub.ent.ada.updt %>% 
  group_by(route_name) %>% 
  count()  
stat.route.join <- sub.ent.rt.count %>% 
  full_join(num.stat.by.rt.wiki, by = "route_name") %>% 
  # W exits in the wiki df, but not in the sub ent/exit df
  filter(route_name != "W")
stat.route.join %>% 
  ggplot(aes(num_stations_norm, n)) +
  # equality line for refence:
  geom_abline(intercept = 0, slope = 1, size = 1, alpha = 0.8) +
  geom_point(size = 2, alpha = 0.8) +
  xlab("Station num by route (wiki)") +
  ylab("Station num by route (actual count)")

stat.route.join %>% 
  filter(num_stations_norm == n)
# A tibble: 1 x 5
# Groups:   route_name [1]
  route_name     n num_stations_norm late_night limited
  <chr>      <int>             <dbl>      <dbl>   <dbl>
1 H              5                 5         NA      NA

TO EDIT

Nearly all, with the exception of the H Shuttle route, had more subway stations assigned to them than expected. The extra stations could not be accounted for by different service patterns (for example local late-night service for some routes). From exploring the dataset, it seems that the source of the problems were:

  • Station duplicates, or the station name differed based on the route, but all of the routes that stopped at that station were still assigned.
  • The dataset being simply outdated - no changes made to account for service changes over the years.
  • Separate stations connected by underground tunnels being listed as one station, with all routes servicing both stations. For example, Times Square / 42nd St is linked by a tunnel to the Port Authority / 42nd St stops, but these are distinct stations, served by distinct subway routes. However, all of the subway routes that serve both stations are assigned to both stations.

At this point, the options, as I saw them, were:

Solution 1: Import a dataset form the NYC Open Data website with updated station coordinates (SOURCE) and try to merge the two datasets Problems: Subway station names repeat, would probably have to still do a lot of manual data cleaning and validation to make sure the stations merge as expected. Also, if the subway route information is out of date, likely the ADA-status of stations is also outdated as well (this assumption turns out to be true in the end).

Solution 2: Manually go through route-by-route to make sure that the information is accurate and station duplicates are removed Problems: Tedious manual work, and likely easy to mess up. Will have to manually add-in new 2nd Ave Q-train stations, along with some others probably.

4 Manual data cleaning

4.1 Reasoning and Resources

I chose Solution 2 because, from working with other NYC subway-related datasets, I’ve learned that subway station name formatting tends to be inconsistent between datasets and a number of stations have duplicate names. Going line by line turned out to be less tedious than I expected in the end, because routes in the same primary trunk line tended to have the same issues that needed fixing (removing double stations, for example). Also, most of the subway stations were accurately labeled according to the primary trunk line 3-letter code (irt, ind, and so on), which helped narrow down the list of stations. It’s not very pretty, but it got the job done.

These tools helped quite a bit:

NYC Subway Map Route list

Disclaimer: The subway routes change considerably between normal, rush-hour, late-night, and weekend service. For sanity, I based the route/station assignments on the wikipedia counts for most lines, except for the B, which seems to have been based on weekday rush-hour service patterns for most trains, and on the routes on the MTA website list. I did not include late-night service changes (even though some routes stop at more stations).

4.2 Going by lowest number of stations

Starting slow, with the shuttles:

num.stat.by.rt.wiki %>% 
  filter(route_name == "GS")
# A tibble: 1 x 4
  route_name num_stations_norm late_night limited
  <chr>                  <dbl>      <dbl>   <dbl>
1 GS                         2         NA      NA
sub.ent.ada.updt %>% 
  filter(route_name == "GS") %>% 
  arrange(stat_name)
# A tibble: 4 x 5
  stat_name                     avg_stat_lat avg_stat_long ada   route_name
  <chr>                                <dbl>         <dbl> <lgl> <chr>     
1 irt_42nd st shuttle_grand ce~         40.8         -74.0 FALSE GS        
2 irt_42nd st shuttle_times sq~         40.8         -74.0 FALSE GS        
3 irt_flushing_grand central-4~         40.8         -74.0 FALSE GS        
4 irt_lexington_grand central-~         40.8         -74.0 FALSE GS        
sub.stat.num.updt <- sub.ent.ada.updt %>% 
  filter(!(route_name == "GS" & (str_detect(stat_name, "flushing") | str_detect(stat_name, "lexington"))))
num.stat.by.rt.wiki %>% 
  filter(route_name == "FS")
# A tibble: 1 x 4
  route_name num_stations_norm late_night limited
  <chr>                  <dbl>      <dbl>   <dbl>
1 FS                         4         NA      NA
sub.ent.rt.count %>% 
  filter(route_name == "FS")
# A tibble: 1 x 2
# Groups:   route_name [1]
  route_name     n
  <chr>      <int>
1 FS             6
sub.stat.num.updt %>% 
  filter(route_name == "FS") %>% 
  arrange(avg_stat_lat)
# A tibble: 6 x 5
  stat_name                     avg_stat_lat avg_stat_long ada   route_name
  <chr>                                <dbl>         <dbl> <lgl> <chr>     
1 bmt_brighton_prospect park            40.7         -74.0 TRUE  FS        
2 bmt_franklin_botanic gardens          40.7         -74.0 FALSE FS        
3 irt_eastern parkway_franklin~         40.7         -74.0 FALSE FS        
4 bmt_franklin_park place               40.7         -74.0 TRUE  FS        
5 bmt_franklin_franklin av              40.7         -74.0 TRUE  FS        
6 ind_fulton_franklin av                40.7         -74.0 TRUE  FS        
# remove the ind and irt stations - those are the extras
sub.stat.num.updt <- sub.stat.num.updt %>% 
  filter(!(route_name == "FS" & (str_detect(stat_name, "irt") | str_detect(stat_name, "ind"))))
sub.stat.num.updt %>% 
  filter(route_name == "FS")
# A tibble: 4 x 5
  stat_name                    avg_stat_lat avg_stat_long ada   route_name
  <chr>                               <dbl>         <dbl> <lgl> <chr>     
1 bmt_franklin_botanic gardens         40.7         -74.0 FALSE FS        
2 bmt_franklin_park place              40.7         -74.0 TRUE  FS        
3 bmt_franklin_franklin av             40.7         -74.0 TRUE  FS        
4 bmt_brighton_prospect park           40.7         -74.0 TRUE  FS        
sub.stat.num.updt %>% 
  group_by(route_name) %>% 
  count() %>% 
  ungroup() %>% 
  inner_join(num.stat.by.rt.wiki, by = "route_name") %>% 
  filter(n != num_stations_norm) %>% 
  filter(num_stations_norm == min(num_stations_norm))
# A tibble: 2 x 5
  route_name     n num_stations_norm late_night limited
  <chr>      <int>             <dbl>      <dbl>   <dbl>
1 G             25                21         NA      NA
2 Z             27                21         NA      NA

G and Z, both with 21 stations each, are next. I had initially wondered if maybe there are duplicates based on station lat/long pairs, and if I could get away with using the unique coordinates:

### G route fix ###
sub.line.tidy %>% 
  filter(route_name == "G")
# A tibble: 1 x 5
  primary_trunk_line color      hexadecimal lines route_name
  <chr>              <chr>      <chr>       <chr> <chr>     
1 IND Crosstown Line Lime green #6cbe45     G     G         
sub.stat.num.updt %>% 
  filter(route_name == "G") %>% 
  select(avg_stat_lat, avg_stat_long) %>% 
  distinct() %>% 
  nrow()
[1] 25

Unfortunately that did not turn out to be the case. From here, I manually checked through the station listings, comparing them to the offical list. The general process is that I would figure out which stations are currently stops of the particular route, then I remove all of the station stops for that particular route from the original df and add back in the correct station list, as follows below for the G train.

4.2.1 G train

Goal number of G train stations: 21

sub.stat.num.updt.G <- sub.stat.num.updt %>%
  filter(route_name != "G") %>% 
  bind_rows(
    sub.stat.num.updt %>% 
      filter(route_name == "G" & str_detect(stat_name, "ind") & stat_name != "ind_queens boulevard_23rd st-ely av")
    )
sub.stat.num.updt.G %>% 
  filter(route_name == "G") %>% 
  nrow()
[1] 21

4.2.2 Z train

Goal number of Z train stations: 21

### Z route fix ###
sub.line.tidy %>% 
  filter(route_name == "Z")
# A tibble: 1 x 5
  primary_trunk_line     color             hexadecimal lines route_name
  <chr>                  <chr>             <chr>       <chr> <chr>     
1 BMT Nassau Street Line Terra cotta brown #996633     J_Z   Z         
sub.stat.num.updt.Z <- sub.stat.num.updt.G %>% 
  filter(route_name != "Z") %>% 
  bind_rows(
    sub.stat.num.updt.G %>% 
      filter(route_name == "Z" & str_detect(stat_name, "bmt") & stat_name != "bmt_broadway_canal st (ul)") %>% 
      # add Z train to the broadway junction stop in Queens (was A/C/J/L)
      bind_rows(
        sub.stat.num.updt.G %>% 
          filter(str_detect(stat_name, "broadway junction")) %>% 
          mutate(route_name = "Z") %>% 
          distinct()
        ) %>% 
      # add Z train to the Alabama Ave stop in Queens (was J only)
      bind_rows(
        sub.stat.num.updt.G %>% 
          filter(str_detect(stat_name, "alabama")) %>% 
          mutate(route_name = "Z")
      ) %>% 
      # add back in the jamaica center and jfk airport stops that were filtered out earlier based on the "bmt" filter
      bind_rows(
        sub.stat.num.updt.G %>% 
          filter(route_name == "Z" & avg_stat_long > -73.82829)
        )
    )
sub.stat.num.updt.Z %>% 
  filter(route_name == "Z") %>% 
  nrow()
[1] 21
## Next fix targets ##
sub.stat.num.updt.Z %>% 
  group_by(route_name) %>% 
  count() %>% 
  ungroup() %>% 
  inner_join(num.stat.by.rt.wiki, by = "route_name") %>% 
  filter(n != num_stations_norm) %>% 
  filter(num_stations_norm == min(num_stations_norm))
# A tibble: 2 x 5
  route_name     n num_stations_norm late_night limited
  <chr>      <int>             <dbl>      <dbl>   <dbl>
1 7             31                22         NA      12
2 E             30                22         32      19

4.2.3 7 train

Goal number of 7 train stations: 22

sub.line.tidy %>% 
  filter(route_name == "7")
# A tibble: 1 x 5
  primary_trunk_line color     hexadecimal lines route_name
  <chr>              <chr>     <chr>       <chr> <chr>     
1 IRT Flushing Line  Raspberry #b933ad     7     7         
sub.stat.num.updt.7 <-  sub.stat.num.updt.Z %>%
  filter(route_name != "7") %>% 
  bind_rows(
    sub.stat.num.updt.Z %>% 
      filter(route_name == "7" & str_detect(stat_name, "irt")) %>% 
      # eliminate station copies at times square and grand central
      filter(!(
        stat_name %in% c(
          "irt_42nd st shuttle_times square", "irt_42nd st shuttle_grand central",
          "irt_lexington_grand central-42nd st"
          )
        )) %>% 
      # convert ADA = TRUE at the court sq station (was incorrectly FALSE)
      mutate(ada = ifelse(stat_name == "irt_flushing_45 rd-court house sq", TRUE, ada))
    )
sub.stat.num.updt.7 %>% 
  filter(route_name == "7") %>% 
  nrow()
[1] 22

Goal number of E train stations: 22

sub.line.tidy %>% 
  filter(route_name == "E")
# A tibble: 1 x 5
  primary_trunk_line     color      hexadecimal lines route_name
  <chr>                  <chr>      <chr>       <chr> <chr>     
1 IND Eighth Avenue Line Vivid blue #2850ad     A_C_E E         
sub.stat.num.updt.E <- sub.stat.num.updt.7 %>% 
  filter(route_name != "E") %>% 
  bind_rows(
    sub.stat.num.updt.7 %>% 
      filter(route_name == "E" & str_detect(stat_name, "ind") &  stat_name != "ind_8 avenue_chambers st") %>% 
      # ADA fix
      mutate(ada = ifelse(stat_name == "ind_archer av_jamaica-van wyck", TRUE, ada)) %>% 
      # E train to Briarwood station
      bind_rows(
        sub.stat.num.updt.7 %>% 
          filter(str_detect(stat_name, "briarwood")) %>% 
          mutate(route_name = "E")
        )
    )

## Next fix targets ##
sub.stat.num.updt.E %>% 
  group_by(route_name) %>% 
  count() %>% 
  ungroup() %>% 
  inner_join(num.stat.by.rt.wiki, by = "route_name") %>% 
  filter(n != num_stations_norm) %>% 
  filter(num_stations_norm == min(num_stations_norm))
# A tibble: 1 x 5
  route_name     n num_stations_norm late_night limited
  <chr>      <int>             <dbl>      <dbl>   <dbl>
1 L             30                24         NA      NA

Goal number of L train stations: 24

sub.line.tidy %>% 
  filter(route_name == "L")
# A tibble: 1 x 5
  primary_trunk_line color            hexadecimal lines route_name
  <chr>              <chr>            <chr>       <chr> <chr>     
1 BMT Canarsie Line  Light slate gray #a7a9ac     L     L         
sub.stat.num.updt.L <-  sub.stat.num.updt.E %>% 
  filter(route_name != "L") %>% 
  bind_rows(
    sub.stat.num.updt.E %>% 
      filter(route_name == "L" & str_detect(stat_name, "bmt") & stat_name != "bmt_broadway_union square") %>% 
      mutate(ada = ifelse(stat_name == "bmt_canarsie_wilson av", TRUE, ada)) %>% 
      # add L train to broadway junction
      bind_rows(
        sub.stat.num.updt.E %>% 
          filter(str_detect(stat_name, "broadway junction") & route_name == "L")
        )
    )
sub.stat.num.updt.L %>% 
  filter(route_name == "L") %>% 
  nrow()
[1] 24
## Next fix targets ##
sub.stat.num.updt.L %>% 
  group_by(route_name) %>% 
  count() %>% 
  ungroup() %>% 
  inner_join(num.stat.by.rt.wiki, by = "route_name") %>% 
  filter(n != num_stations_norm) %>% 
  filter(num_stations_norm == min(num_stations_norm))
# A tibble: 1 x 5
  route_name     n num_stations_norm late_night limited
  <chr>      <int>             <dbl>      <dbl>   <dbl>
1 B             53                27         NA      37

Goal number of B train stations: 37 (will include extra rush hour stations)

sub.line.tidy %>% 
  filter(route_name == "B")
# A tibble: 1 x 5
  primary_trunk_line    color         hexadecimal lines   route_name
  <chr>                 <chr>         <chr>       <chr>   <chr>     
1 IND Sixth Avenue Line Bright orange #ff6319     B_D_F_M B         
sub.stat.num.updt.B <- sub.stat.num.updt.L %>% 
  filter(route_name != "B") %>% 
  bind_rows(
    sub.stat.num.updt.L %>% 
      filter(route_name == "B" & (str_detect(stat_name, "bmt") | str_detect(stat_name, "ind"))) %>% 
      # convert non-ada stations to those that are ada = TRUE now
      mutate(ada = ifelse(
        stat_name %in% c(
          "bmt_brighton_kings highway", "ind_6 avenue_broadway-lafayette st", "ind_8 avenue_125th st"
          ),
        TRUE, ada)) %>% 
      # stations to exclude: 
      # atlantic ave /barclays duplicates and stops between barclays and brighton where B does not stop
      filter(!(avg_stat_lat > 40.60867 & avg_stat_lat < 40.63508)) %>% 
      filter(!(
        stat_name %in% c(
          "bmt_broadway_34th st", "bmt_brighton_parkside av", 
          "bmt_4 avenue_pacific st", "bmt_brighton_atlantic av", 
          "bmt_brighton_av u", "bmt_brighton_neck rd", 
          "bmt_brighton_beverly rd", "bmt_brighton_cortelyou rd"
          )
        ))
    )

## Next fix targets ##
sub.stat.num.updt.B %>% 
  group_by(route_name) %>% 
  count() %>% 
  ungroup() %>% 
  inner_join(num.stat.by.rt.wiki, by = "route_name") %>% 
  filter(n != num_stations_norm) %>%
  filter(route_name != "B") %>% 
  filter(num_stations_norm == min(num_stations_norm))
# A tibble: 1 x 5
  route_name     n num_stations_norm late_night limited
  <chr>      <int>             <dbl>      <dbl>   <dbl>
1 4             45                28         54      NA

Goal number of 4 train stations: 28 (need to exclude late night service)

sub.stat.num.updt.4 <- sub.stat.num.updt.B %>% 
  filter(route_name != "4") %>% 
  bind_rows(
    sub.stat.num.updt.B %>% 
      filter(route_name == "4" & str_detect(stat_name, "irt")) %>% 
      filter(!(
        stat_name %in% c(
          "irt_flushing_grand central-42nd st", "irt_42nd st shuttle_grand central", 
          "irt_clark_fulton st", "irt_clark_borough hall"
          )
        )) %>% 
      mutate(ada = ifelse(stat_name == "irt_lexington_fulton st", TRUE, ada))
    )

4.3 Switch to primary trunk lines

After fixing one or two per trunk line, I figured that the other train routes along that line would have similar problems for most stations (and similar fixes).

Goal number of J train stations: 30

sub.stat.num.updt.J <- sub.stat.num.updt.4 %>% 
  filter(route_name != "J") %>% 
  bind_rows(
    sub.stat.num.updt.4 %>% 
      filter(route_name == "J" & str_detect(stat_name, "bmt") & stat_name != "bmt_broadway_canal st (ul)") %>%
      # add back in jamaica center and the jfk airport stop
      bind_rows(
        sub.stat.num.updt.4 %>% 
          filter(route_name == "J" & avg_stat_long > -73.82829)
      ) %>%
      # add J train to broadway junction
      bind_rows(
        sub.stat.num.updt.4 %>% 
          filter(str_detect(stat_name, "broadway junction") & route_name == "J")
      ) %>% 
      mutate(ada = ifelse(stat_name == "bmt_nassau_fulton st", TRUE, ada))
    )
sub.stat.num.updt.J %>% 
  group_by(route_name) %>% 
  count() %>% 
  ungroup() %>% 
  inner_join(num.stat.by.rt.wiki, by = "route_name") %>% 
  filter(n != num_stations_norm) %>%
  filter(route_name != "B")
# A tibble: 13 x 5
   route_name     n num_stations_norm late_night limited
   <chr>      <int>             <dbl>      <dbl>   <dbl>
 1 1             45                38         NA      NA
 2 2             66                49         61      52
 3 3             51                34          9      NA
 4 5             62                45         NA      53
 5 6             48                38         NA      29
 6 A             60                44         66      NA
 7 C             52                40         NA      NA
 8 D             45                36         41      NA
 9 F             56                45         NA      NA
10 M             45                36          8      13
11 N             46                32         45      22
12 Q             49                29         34      NA
13 R             63                45         34      17

Goal number of A train stations: 44

sub.stat.num.updt.A <- sub.stat.num.updt.J %>% 
  filter(route_name != "A") %>% 
  bind_rows(
    sub.stat.num.updt.J %>% 
      filter(route_name == "A" & str_detect(stat_name, "ind")) %>%
      # remove stations where the A does not stop
      filter(!(
        stat_name %in% c(
          "ind_8 avenue_world trade center", "ind_8 avenue_broadway-nassau", 
          "ind_fulton_franklin av", "ind_fulton_kingston-throop",
          "ind_fulton_ralph av", "ind_fulton_rockaway av",
          "ind_fulton_liberty av", "ind_fulton_van siclen av", 
          "ind_fulton_shepherd av"
          )
        )) %>% 
      # add in the fulton st stop in manhattan
      bind_rows(
        sub.stat.num.updt.J %>% 
          filter(str_detect(stat_name, "fulton st") & route_name == "4") %>% 
          mutate(route_name = "A")
      ) %>% 
      mutate(ada = ifelse(
        stat_name %in% c(
          "ind_8 avenue_125th st", "ind_rockaway_far rockaway-mott av",
          "ind_rockaway_aqueduct racetrack", "ind_fulton_jay st - borough hall",
          "ind_fulton_utica av", "ind_liberty_lefferts blvd"
          ), 
        TRUE, ada
        )) %>% 
      # add in rockaway beach stops that A makes during rush hour
      bind_rows(
        sub.stat.num.updt.J %>% 
          filter(route_name == "H" & !(str_detect(stat_name, "broad channel"))) %>% 
          mutate(route_name = "A")
        )
    )

Goal number of C train stations: 40

sub.stat.num.updt.C <- sub.stat.num.updt.A %>% 
  filter(route_name != "C") %>% 
  bind_rows(
    sub.stat.num.updt.A %>% 
      filter(route_name == "C" & str_detect(stat_name, "ind")) %>% 
      filter(!(stat_name %in% c("ind_8 avenue_broadway-nassau", "ind_8 avenue_world trade center"))) %>% 
      bind_rows(
            sub.stat.num.updt.A %>% 
              filter(str_detect(stat_name, "fulton st") & route_name == "A") %>% 
              mutate(route_name = "C")
        ) %>%
      mutate(ada = ifelse(stat_name %in% c("ind_8 avenue_125th st", "ind_fulton_jay st - borough hall", "ind_fulton_utica av"), TRUE, ada))
)
sub.stat.num.updt.C %>% 
  filter(route_name == "C") %>% 
  nrow()
[1] 40

Goal number of 5 train stations: 45

sub.stat.num.updt.5 <- sub.stat.num.updt.C %>% 
  filter(route_name != "5") %>% 
  bind_rows(
    sub.stat.num.updt.C %>% 
      filter(route_name == "5" & str_detect(stat_name, "irt")) %>% 
      filter(!(
        stat_name %in% c(
          "irt_clark_borough hall", "irt_clark_fulton st", 
          "irt_flushing_grand central-42nd st", "irt_42nd st shuttle_grand central", 
          "irt_white plains road_wakefield-241st st"
          )
        )) %>%
      mutate(ada = ifelse(
        stat_name %in% c(
          "irt_lexington_fulton st", "irt_white plains road_east 180th st",
          "irt_white plains road_gun hill rd"
          ),
        TRUE, ada
        ))
    )
sub.stat.num.updt.5 %>% 
  filter(route_name == "5") %>% 
  nrow()
[1] 45

Goal number of 6 train stations: 38

sub.stat.num.updt.6 <- sub.stat.num.updt.5 %>% 
  filter(route_name != "6") %>% 
  bind_rows(
    sub.stat.num.updt.5 %>% 
      filter(route_name == "6" & str_detect(stat_name, "irt")) %>% 
      filter(!(stat_name %in% c("irt_flushing_grand central-42nd st", "irt_42nd st shuttle_grand central"))) %>% 
      mutate(ada = ifelse(stat_name %in% c("irt_lexington_bleecker st"), TRUE, ada))
    )
sub.stat.num.updt.6 %>% 
  filter(route_name == "6") %>% 
  nrow()
[1] 38
sub.stat.num.updt.6 %>% 
  group_by(route_name) %>% 
  count() %>% 
  ungroup() %>% 
  inner_join(num.stat.by.rt.wiki, by = "route_name") %>% 
  filter(n != num_stations_norm) %>%
  filter(route_name != "B")
# A tibble: 9 x 5
  route_name     n num_stations_norm late_night limited
  <chr>      <int>             <dbl>      <dbl>   <dbl>
1 1             45                38         NA      NA
2 2             66                49         61      52
3 3             51                34          9      NA
4 D             45                36         41      NA
5 F             56                45         NA      NA
6 M             45                36          8      13
7 N             46                32         45      22
8 Q             49                29         34      NA
9 R             63                45         34      17

D/F/M next:

Goal number of D train stations: 36

sub.stat.num.updt.D <- sub.stat.num.updt.6 %>% 
  filter(route_name != "D") %>% 
  bind_rows(
    sub.stat.num.updt.6 %>% 
      # the D train stops are a mix of ind and bmt 
      filter(route_name == "D" & (str_detect(stat_name, "ind") | str_detect(stat_name, "bmt"))) %>% 
      filter(!(
        stat_name %in% c(
          "bmt_broadway_34th st", "bmt_4 avenue_pacific st", 
          "bmt_brighton_atlantic av", "bmt_sea beach_new utrecht av", 
          "bmt_brighton_stillwell av")
        )) %>% 
      # add in the brooklyn 36th st stop:
      bind_rows(
        sub.stat.num.updt.6 %>% 
          filter(str_detect(stat_name, "4 avenue_36")) %>% 
          mutate(route_name = "D") %>% 
          distinct()
        ) %>% 
      mutate(ada = ifelse(
        stat_name %in% c("ind_8 avenue_125th st", "ind_6 avenue_broadway-lafayette st", "bmt_west end_bay parkway"), TRUE, ada)
        )
    )
# is the number of stations correct now?
sub.stat.num.updt.D %>% 
  filter(route_name == "D") %>% 
  nrow()
[1] 36

Goal number of F train stations: 45

sub.stat.num.updt.F <- sub.stat.num.updt.D %>% 
  filter(route_name != "F") %>% 
  bind_rows(
    sub.stat.num.updt.D %>% 
      filter(route_name == "F" & (str_detect(stat_name, "ind") | str_detect(stat_name, "bmt"))) %>% 
      filter(!(
        stat_name %in% c(
          "bmt_broadway_34th st", "bmt_canarsie_6th av", 
          "bmt_nassau_essex st", "bmt_broadway_lawrence st", 
          "bmt_4 avenue_9th st", "bmt_brighton_stillwell av", 
          "bmt_brighton_west 8th st"
          )
        )) %>% 
      mutate(ada = ifelse(stat_name %in% c("ind_6 avenue_broadway-lafayette st", "ind_fulton_jay st - borough hall"), TRUE, ada))
    )
sub.stat.num.updt.F %>% 
  filter(route_name == "F") %>% 
  nrow()
[1] 45

Goal number of M train stations: 36

sub.stat.num.updt.M <- sub.stat.num.updt.F %>% 
  filter(route_name != "M") %>% 
  bind_rows(
    sub.stat.num.updt.F %>% 
      filter(route_name == "M" & (str_detect(stat_name, "ind") | str_detect(stat_name, "bmt"))) %>% 
      filter(!(stat_name %in% c("bmt_broadway_34th st", "bmt_canarsie_6th av", "bmt_nassau_essex st"))) %>% 
      mutate(ada = ifelse(stat_name %in% c("ind_6 avenue_broadway-lafayette st"), TRUE, ada))
    )
# is the number of stations correct now?
sub.stat.num.updt.M %>% 
  filter(route_name == "M") %>% 
  nrow()
[1] 36
sub.stat.num.updt.M %>% 
  group_by(route_name) %>% 
  count() %>% 
  ungroup() %>% 
  inner_join(num.stat.by.rt.wiki, by = "route_name") %>% 
  filter(n != num_stations_norm) %>%
  filter(route_name != "B")
# A tibble: 6 x 5
  route_name     n num_stations_norm late_night limited
  <chr>      <int>             <dbl>      <dbl>   <dbl>
1 1             45                38         NA      NA
2 2             66                49         61      52
3 3             51                34          9      NA
4 N             46                32         45      22
5 Q             49                29         34      NA
6 R             63                45         34      17

1/2/3 next:

Goal number of 1 train stations: 38

sub.stat.num.updt.1 <- sub.stat.num.updt.M %>% 
  filter(route_name != "1") %>% 
  bind_rows(
    sub.stat.num.updt.M %>% 
      filter(route_name == "1" & str_detect(stat_name, "irt")) %>%
      filter(!(stat_name %in% c("irt_42nd st shuttle_times square", "irt_clark_park place"))) %>% 
      # add in the re-opened WTC Cortlandt station (doesn't exist in dataset, coord from wikipedia)
      bind_rows(
        tibble(
          stat_name = "irt_broadway-7th ave_wtc cortlandt", ada = TRUE, 
          avg_stat_lat = 40.7115, avg_stat_long = -74.012, route_name = "1")
        ) %>% 
      mutate(ada = ifelse(stat_name %in% c("irt_broadway-7th ave_dyckman st", "irt_broadway-7th ave_168th st"), TRUE, ada))
    )
sub.stat.num.updt.1 %>% 
  filter(route_name == "1") %>% 
  nrow()
[1] 38

Goal number of 2 train stations: 49

sub.stat.num.updt.2 <- sub.stat.num.updt.1 %>% 
  filter(route_name != "2") %>% 
  bind_rows(
    sub.stat.num.updt.1 %>% 
      filter(route_name == "2" & str_detect(stat_name, "irt")) %>% 
      filter(!(
        stat_name %in% c(
          "irt_42nd st shuttle_times square", "irt_lexington_fulton st",
          "irt_lexington_borough hall"
          )
        )) %>% 
      mutate(ada = ifelse(
        stat_name %in% c(
          "irt_white plains road_gun hill rd", "irt_white plains road_east 180th st",
          "irt_clark_fulton st"
          ),
        TRUE, ada
        ))
    )
sub.stat.num.updt.2 %>% 
  filter(route_name == "2") %>% 
  nrow()
[1] 49

Goal number of 3 train stations: 34

sub.stat.num.updt.3 <- sub.stat.num.updt.2 %>% 
  filter(route_name != "3") %>% 
  bind_rows(
    sub.stat.num.updt.2 %>% 
      filter(route_name == "3" & str_detect(stat_name, "irt")) %>% 
      filter(!(stat_name %in% c("irt_42nd st shuttle_times square", "irt_lexington_fulton st", "irt_lexington_borough hall"))) %>%
      mutate(ada = ifelse(stat_name %in% c("irt_clark_fulton st"), TRUE, ada))
    )
sub.stat.num.updt.3 %>% 
  filter(route_name == "3") %>% 
  nrow()
[1] 34
sub.stat.num.updt.3 %>% 
  group_by(route_name) %>% 
  count() %>% 
  ungroup() %>% 
  inner_join(num.stat.by.rt.wiki, by = "route_name") %>% 
  filter(n != num_stations_norm) %>%
  filter(route_name != "B")
# A tibble: 3 x 5
  route_name     n num_stations_norm late_night limited
  <chr>      <int>             <dbl>      <dbl>   <dbl>
1 N             46                32         45      22
2 Q             49                29         34      NA
3 R             63                45         34      17

Now for the N/Q/R/W route updates that I’ve been avoiding:

Goal number of R train stations: 45

sub.stat.num.updt.R <- sub.stat.num.updt.3 %>% 
  filter(route_name != "R") %>% 
  bind_rows(
    sub.stat.num.updt.3 %>% 
      filter(route_name == "R" & (str_detect(stat_name, "bmt") | str_detect(stat_name, "ind"))) %>% 
      filter(!(stat_name %in% c(
        "ind_6 avenue_smith-9th st", "bmt_4 avenue_pacific st", 
        "bmt_brighton_atlantic av", "ind_fulton_jay st - borough hall", 
        "bmt_nassau_canal st", "bmt_canarsie_union square", 
        "ind_6 avenue_34th st", "ind_8 avenue_42nd st"
        ))) %>% 
      mutate(ada = ifelse(stat_name %in% c("bmt_broadway_lawrence st", "bmt_broadway_cortlandt st"), TRUE, ada))
    )
sub.stat.num.updt.R %>% 
  filter(route_name == "R") %>% 
  nrow()
[1] 45

Goal number of N train stations: 32

sub.stat.num.updt.N <- sub.stat.num.updt.R %>% 
  filter(route_name != "N") %>% 
  bind_rows(
    sub.stat.num.updt.R %>% 
      filter(route_name == "N" & str_detect(stat_name, "bmt")) %>% 
      filter(!(stat_name %in% c(
        "bmt_brighton_stillwell av", "bmt_west end_62nd st", 
        "bmt_brighton_atlantic av", "bmt_4 avenue_pacific st", 
        "bmt_nassau_canal st", "bmt_canarsie_union square"
        ))) %>% 
      # add back in queensboro plaza, only route that has irt division instead of bmt
      bind_rows(sub.stat.num.updt.R %>% filter(route_name == "N" & str_detect(stat_name, "queensboro")))
    )
sub.stat.num.updt.N %>% 
  filter(route_name == "N") %>% 
  nrow()
[1] 32

The W train was introduced to replace the Q train in Astoria when the Q was rerouted up 2nd Ave in Manhattan from its original route in Queens. Unsurprisingly, since the subway entrances/exits dataset contains the old Q train route information, the W is also not listed. Luckily, the W route is a mashup of the R and N routes, so those route sections can simply be stitched together to create the W.

Goal number of W train stations: 23

num.stat.by.rt.wiki %>% 
  filter(route_name == "W")
# A tibble: 1 x 4
  route_name num_stations_norm late_night limited
  <chr>                  <dbl>      <dbl>   <dbl>
1 W                         23         44      NA
sub.stat.num.updt.W <- sub.stat.num.updt.N %>% 
  bind_rows(
    sub.stat.num.updt.N %>% 
      filter(route_name == "N" & avg_stat_lat > 40.68367) %>% 
      bind_rows(
        sub.stat.num.updt.N %>% 
          filter(route_name == "R") %>% 
          filter(avg_stat_lat > 40.69410 & avg_stat_lat < 40.71952)
        ) %>% 
      mutate(route_name = "W")
    )
sub.stat.num.updt.W %>% 
  filter(route_name == "W") %>% 
  nrow()
[1] 23

Last, but not least, the Q train update, which included removing the old stations stops in Queens, adding in the 2nd Ave stops that did not exist in the dataset, and removing some station stops in Brooklyn.

Goal number of Q train stations: 29

sub.stat.final <- sub.stat.num.updt.W %>% 
  filter(route_name != "Q") %>% 
  bind_rows(
    sub.stat.num.updt.W %>% 
      filter(route_name == "Q" & !str_detect(stat_name, "astoria") & str_detect(stat_name, "bmt")) %>%
      # add in 3 new 2nd Ave stops at 72nd St, 86th St, and 96th St
      bind_rows(
        tibble(
          stat_name = c("ind_2 avenue_72nd st", "ind_2 avenue_86th st", "ind_2 avenue_96th st"), 
          # all are ADA = TRUE
          ada = rep(TRUE, 3), 
          # lat and long from wikipedia pages for eachs tation
          avg_stat_lat = c(40.768889, 40.777861, 40.7841), 
          avg_stat_long = c(-73.958333, -73.95175, -73.9472),
          # only the Q stops at these stations on a regular schedule
          route_name = rep("Q", 3)
          )
        ) %>% 
      # add in the 63rd St, where only the F used to stop, but now the Q stops there
      bind_rows(
        sub.stat.num.updt.W %>% 
          filter(stat_name == "ind_63rd street_lexington av") %>% 
          mutate(route_name = "Q")
      ) %>% 
      filter(!(stat_name %in% c(
        "bmt_broadway_5th av", "bmt_broadway_lexington av", 
        "bmt_broadway_49th st", "bmt_canarsie_union square", 
        "bmt_nassau_canal st", "bmt_4 avenue_pacific st", 
        "bmt_brighton_atlantic av", "bmt_coney island_stillwell av", 
        "bmt_coney island_west 8th st"
        ))) %>%
      mutate(ada = ifelse(stat_name %in% c("bmt_brighton_av h", "bmt_brighton_kings highway"), TRUE, ada))
    )
sub.stat.final %>% 
  filter(route_name == "Q") %>% 
  nrow()
[1] 29
sub.stat.final %>% 
  group_by(route_name) %>% 
  count() %>% 
  ungroup() %>% 
  inner_join(num.stat.by.rt.wiki, by = "route_name") %>% 
  filter(n != num_stations_norm) %>%
  filter(route_name != "B")
# A tibble: 0 x 5
# ... with 5 variables: route_name <chr>, n <int>,
#   num_stations_norm <dbl>, late_night <dbl>, limited <dbl>

At last! The subway station route counts match.

nrow(sub.stat.num.updt)
[1] 978
nrow(sub.stat.final)
[1] 750

The final row count, after removing all of those extra stations, is 750.

Now to remove all of the extra dataframes:

rm(list = ls(pattern = "sub.stat.num.updt."))

5 Analysis and Visualization

Now that the subway station dataset is relatively clean and tidy, it’s possible to finally move on to do some geospatial analysis with it and the NYC shapefiles. Before that, although the station dataset includes the station spatial coordinates, it is not a spatial object in the same sense that the map shapefiles are. The latitude and longitude will have to be interpreted through a coordinate reference system. Earlier, I had changed the reference system of one maps, but here I will turn the entire subway dataset into a sf spatial object for further analysis. This is done by first converting it into a spatial object, which needs the correct crs information. Then, the crs of the new spatial object can be converted to match the NYC map shapefile units.

sub.cln.sf <- st_as_sf(sub.stat.final, coords = c("avg_stat_long", "avg_stat_lat"), crs = "+init=epsg:4326")
head(sub.cln.sf)
Simple feature collection with 6 features and 3 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: -73.97919 ymin: 40.58321 xmax: -73.81364 ymax: 40.75277
epsg (SRID):    4326
proj4string:    +proj=longlat +datum=WGS84 +no_defs
# A tibble: 6 x 4
  stat_name                         ada   route_name             geometry
  <chr>                             <lgl> <chr>              <POINT [°]>
1 irt_42nd st shuttle_grand central FALSE GS         (-73.97919 40.75277)
2 bmt_franklin_botanic gardens      FALSE FS         (-73.95924 40.67034)
3 bmt_franklin_park place           TRUE  FS         (-73.95762 40.67477)
4 ind_rockaway_beach 105th st       FALSE H          (-73.82756 40.58321)
5 ind_rockaway_beach 90th st        FALSE H          (-73.81364 40.58803)
6 ind_rockaway_beach 98th st        FALSE H          (-73.82056 40.58531)
sub.sf.nyc.crs <- st_transform(sub.cln.sf, crs = st_crs(nyc.census.4boro))
head(sub.sf.nyc.crs)
Simple feature collection with 6 features and 3 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 990015.9 ymin: 151802.3 xmax: 1036011 ymax: 213531.4
epsg (SRID):    NA
proj4string:    +proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0 +datum=NAD83 +units=us-ft +no_defs
# A tibble: 6 x 4
  stat_name                       ada   route_name                 geometry
  <chr>                           <lgl> <chr>      <POINT [US_survey_foot]>
1 irt_42nd st shuttle_grand cent~ FALSE GS              (990015.9 213531.4)
2 bmt_franklin_botanic gardens    FALSE FS              (995555.6 183503.1)
3 bmt_franklin_park place         TRUE  FS              (996004.5 185116.9)
4 ind_rockaway_beach 105th st     FALSE H                (1032148 151802.3)
5 ind_rockaway_beach 90th st      FALSE H                (1036011 153568.1)
6 ind_rockaway_beach 98th st      FALSE H                (1034091 152570.6)

The subway station stop dataset now has a geometry column with in US foot units and can be combined and layered with the map shapefiles for visualization and spatial analysis purposes.

Here are the station stops visualized over the NYC neighborhoods map, colored according to their traditional service line colors:

nyc.neigh.4boro %>% 
  ggplot() +
  # plot the NYC neighborhoods map
  geom_sf() +
  # layer the station stops:
  geom_sf(
    data = sub.sf.nyc.crs %>% 
      mutate(route_name = ifelse(route_name %in% c("FS", "GS", "H"), "S", route_name)) %>% 
      left_join(sub.line.tidy, by = "route_name"), 
    size = 1.5, aes(color = primary_trunk_line), alpha = 0.8
    ) +
  scale_color_manual(
    # color according to trunk line
    values = c("#fccc0a", "#a7a9ac", "#996633", "#6cbe45", "#2850ad", "#ff6319", "#ee352e", "#b933ad", "#00933c", "#808183"),
    name = "Subway Trunk Line"
    )

5.1 By Route

Now that the dataset is in good shape, it’s time to actually dig into it, starting with looking at how many station stops (ADA-accessible and not) are along each route:

stat.by.route <- sub.stat.final %>% 
  group_by(route_name) %>% 
  summarize(
    # total number of stations by route:
    total_stat = n(),
    # ada is TRUE/FALSE - use this to get total number of ada stations by route:
    num_ada = sum(ada),
    # percent of total num of station stops that are ada, by route:
    per_ada = num_ada * 100 / total_stat
  ) %>% 
  # convert the 3 shuttles into route_name == "S", to match other datasets
  mutate(
    rt_mod = ifelse(
      route_name == "GS" | route_name == "FS" | route_name == "H",
      "S",
      route_name
      )
    ) %>% 
  # join to get subway route groupings, colors
  left_join(sub.line.tidy, by = c("rt_mod" = "route_name")) %>% 
  arrange(primary_trunk_line, route_name)
# create factor for nice route order in plots
stat.by.route$rt_order <- factor(stat.by.route$route_name, levels = stat.by.route$route_name)
# rearrange to match up colors with routes:
subway.by.line <- subway.by.line %>% 
  arrange(primary_trunk_line)

Everything is prepped, so let’s plot, starting with a bar graph of the total number and the number of ADA accessible subway station stops by route. The bars will be colored by their traditional NYC subway colors:

# average number of subway stations per route
mean(stat.by.route$total_stat)
[1] 30
stat.by.route %>% 
  ggplot(aes(rt_order, total_stat, fill = primary_trunk_line)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = subway.by.line$hexadecimal, name = "Line") +
  ylab("Total Number of Stations") +
  xlab("Subway Route")

# average number of ada-accessible stations per route:
mean(stat.by.route$num_ada)
[1] 9.12
stat.by.route %>% 
  ggplot(aes(rt_order, num_ada, fill = primary_trunk_line)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = subway.by.line$hexadecimal, name = "Line") +
  ylab("Number of ADA Stations") +
  xlab("Subway Route")

The total number of station stops varies quite a bit between the same line routes because some are local and others are express in stretches. The shuttles have the least number of station stops, they are relatively short. The L, G, and 7 routes are they only routes in their line groupings, but they still get their own colors.

The number of ADA-accessible stations by route is far less than the total number of stations. By itself, this is not very informative, but let’s take a look at the percent of subway stations that are ADA-accessible per route:

stat.by.route %>% 
  ggplot(aes(rt_order, per_ada, fill = primary_trunk_line)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = subway.by.line$hexadecimal, name = "Line") +
  ylab("Percent of Stations that are ADA") +
  xlab("Subway Route")

nrow(stat.by.route)
[1] 25
stat.by.route %>% 
  filter(per_ada < 25) %>% 
  select(route_name:per_ada)
# A tibble: 10 x 4
   route_name total_stat num_ada per_ada
   <chr>           <int>   <int>   <dbl>
 1 N                  32       6   18.8 
 2 R                  45      11   24.4 
 3 W                  23       5   21.7 
 4 L                  24       5   20.8 
 5 J                  30       5   16.7 
 6 Z                  21       3   14.3 
 7 G                  21       1    4.76
 8 6                  38       9   23.7 
 9 GS                  2       0    0   
10 H                   5       1   20   
stat.by.route %>% 
  filter(per_ada > 50) %>% 
  select(route_name:per_ada)
# A tibble: 2 x 4
  route_name total_stat num_ada per_ada
  <chr>           <int>   <int>   <dbl>
1 E                  22      14    63.6
2 FS                  4       3    75  

10 out of the 25 total routes have fewer than 25% ada-accessible station stops and only 2 trains, one of which is a shuttle train with 4 total stops, make it above 50% accessibility.

what is the relationship between the total number of subway station stops per route with the proprotion of stations that are ADA-accessible, imaginably that would have an impact.

stat.by.route %>% 
  ggplot(aes(total_stat, num_ada, color = primary_trunk_line)) +
  geom_abline(slope = 1, intercept = 0, alpha = 0.5, size = 1.5) +
  geom_point(size = 3) +
  scale_color_manual(values = subway.by.line$hexadecimal, name = "Line") +
  xlab("Total Number of Stations") +
  ylab("Number of ADA Stations")

Superficially, the number of ADA-accessible stations does increase with the number of total station stops per route, however that increase is far less than unity, which is represented by the grey line in the plot above.

What about the relationship between the percent of stations that are ada vs total number of stations, perhaps with train routes that make more stops it would be more expensive to have more ada-accessible stations.

stat.by.route %>% 
  ggplot(aes(total_stat, per_ada, color = primary_trunk_line, label = route_name)) +
  geom_point(size = 3) +
  scale_color_manual(values = subway.by.line$hexadecimal, name = "Line") +
  geom_text(hjust = 0, vjust = 0, check_overlap = TRUE, nudge_y = 1.5) +
  xlab("Total Number of Stations") +
  ylab("Percent of Stations that are ADA")

The result is something of a funnel shape, and there’s not an obvious relationship.

Another way visualization method that could be used to compare the total number of stations and the number that are ada accessible is a barbell plot. First, the dataframe is reorders by the total number of stations per route, in order to make lines with a similar number of stops easier to compare. Then the total number of stations can be plotted in blue, the number of ada-accessible stations in orange, and a line segment can be drawn to connect the two points. The longer the line segment, the greater the difference between the two numbers.

stat.by.route <- stat.by.route %>% 
  arrange(total_stat, num_ada) %>% 
  mutate(total_stat_order = factor(route_name, levels = route_name))
stat.by.route %>% 
  ggplot(aes(total_stat_order, total_stat)) +
  geom_segment(aes(x = total_stat_order, xend = total_stat_order, y = num_ada, yend = total_stat), size = 1, alpha= 0.5, color = "gray60") +
  # total number of station stops on route in orange
  geom_point(size = 3, color = "#fc8d59") +
  # number of ada-accessible station stops on route in blue
  geom_point(aes(total_stat_order, num_ada), size = 3, color = "#91bfdb") +
  xlab("Route Name") +
  ylab("Number of Stations") +
  ggtitle("Total number of stations (orange) vs number of ADA-accessible stations (blue)") +
  coord_flip()

This can be another handy way to notice a few things, such as that the E, 7, Z, and G all have about the same number of total stations, but a very different number of ADA-accessible stations.

5.2 By Borough

Moving on to the geospatial side of things, let’s first explore and compare subway accessibility across the 4 boroughs: Bronx, Brooklyn, Manhattan, and Queens. This is where the sf package starts to become really useful. Earlier, I had created the object sub.sf.nyc.crs which contains the subway station data and looks like this:

head(sub.sf.nyc.crs)
Simple feature collection with 6 features and 3 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 990015.9 ymin: 151802.3 xmax: 1036011 ymax: 213531.4
epsg (SRID):    NA
proj4string:    +proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0 +datum=NAD83 +units=us-ft +no_defs
# A tibble: 6 x 4
  stat_name                       ada   route_name                 geometry
  <chr>                           <lgl> <chr>      <POINT [US_survey_foot]>
1 irt_42nd st shuttle_grand cent~ FALSE GS              (990015.9 213531.4)
2 bmt_franklin_botanic gardens    FALSE FS              (995555.6 183503.1)
3 bmt_franklin_park place         TRUE  FS              (996004.5 185116.9)
4 ind_rockaway_beach 105th st     FALSE H                (1032148 151802.3)
5 ind_rockaway_beach 90th st      FALSE H                (1036011 153568.1)
6 ind_rockaway_beach 98th st      FALSE H                (1034091 152570.6)

What is missing from this dataset is a borough assignment. While one option might be to find another dataset that matches subway stations to borough, or to try and create complex rules to assign stations a borough by latitude and longitude, instead we can take advantage of the spatial join st_join function from the sf package. Because subway dataset and the NYC maps now have the same CRS, a spatial join can be used to merge the data based on geometry.

sub.boro <- st_join(sub.sf.nyc.crs, nyc.census.4boro)
head(sub.boro)
Simple feature collection with 6 features and 14 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 990015.9 ymin: 151802.3 xmax: 1036011 ymax: 213531.4
epsg (SRID):    NA
proj4string:    +proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0 +datum=NAD83 +units=us-ft +no_defs
                          stat_name   ada route_name ctlabel borocode
1 irt_42nd st shuttle_grand central FALSE         GS      94        1
2      bmt_franklin_botanic gardens FALSE         FS     213        3
3           bmt_franklin_park place  TRUE         FS     305        3
4       ind_rockaway_beach 105th st FALSE          H     938        4
5        ind_rockaway_beach 90th st FALSE          H  942.02        4
6        ind_rockaway_beach 98th st FALSE          H  942.01        4
   boroname ct2010 boroct2010 cdeligibil ntacode
1 manhattan 009400    1009400          I    MN17
2  brooklyn 021300    3021300          E    BK63
3  brooklyn 030500    3030500          E    BK61
4    queens 093800    4093800          E    QN10
5    queens 094202    4094202          E    QN12
6    queens 094201    4094201          E    QN10
                                                ntaname puma shape_leng
1                                 Midtown-Midtown South 3807   5738.940
2                                   Crown Heights South 4011   7277.708
3                                   Crown Heights North 4006   7369.206
4 Breezy Point-Belle Harbor-Rockaway Park-Broad Channel 4114  13397.531
5                              Hammels-Arverne-Edgemere 4114  21766.241
6 Breezy Point-Belle Harbor-Rockaway Park-Broad Channel 4114   8656.589
  shape_area                  geometry
1    1646379 POINT (990015.9 213531.4)
2    1989395 POINT (995555.6 183503.1)
3    3474329 POINT (996004.5 185116.9)
4    8701238  POINT (1032148 151802.3)
5    7609021  POINT (1036011 153568.1)
6    3813369  POINT (1034091 152570.6)

All of the information that was in the NYC census shapefile has been merged with the subway station stops data. For now, since the focus of this section is only on the borough summary data, we’ll remove the geometry information and columns relating to neighborhoods and census tracts:

Population estimate count: https://en.wikipedia.org/wiki/Demographics_of_New_York_City

sub.boro.nogeo <- sub.boro %>% 
  # get rid of geometry, turn back to just a dataframe:
  st_set_geometry(NULL) %>% 
  as_tibble() %>% 
  select(stat_name:route_name, boroname:boroct2010, ntaname, shape_area)
# preview of new df:
head(sub.boro.nogeo)
# A tibble: 6 x 8
  stat_name  ada   route_name boroname ct2010 boroct2010 ntaname shape_area
  <chr>      <lgl> <chr>      <chr>    <fct>  <fct>      <fct>        <dbl>
1 irt_42nd ~ FALSE GS         manhatt~ 009400 1009400    Midtow~   1646379.
2 bmt_frank~ FALSE FS         brooklyn 021300 3021300    Crown ~   1989395.
3 bmt_frank~ TRUE  FS         brooklyn 030500 3030500    Crown ~   3474329.
4 ind_rocka~ FALSE H          queens   093800 4093800    Breezy~   8701238.
5 ind_rocka~ FALSE H          queens   094202 4094202    Hammel~   7609021.
6 ind_rocka~ FALSE H          queens   094201 4094201    Breezy~   3813369.
# group by and summrize station counts by borough:
boro.stat.count <- sub.boro.nogeo %>%
  group_by(boroname) %>% 
  summarize(
    # total number of stations
    tot_stat = n(),
    # ada station count
    ada_stat = sum(ada),
    not_ada = tot_stat - ada_stat,
    # percet of stations that are ada:
    ada_percent = ada_stat * 100 / tot_stat
    ) %>% 
  arrange(desc(tot_stat)) %>% 
  mutate(tot_plot_ord = factor(boroname, levels = boroname, labels = c("Manhattan", "Brooklyn", "Queens", "Bronx")))
boro.stat.count %>% 
  ggplot(aes(tot_plot_ord, tot_stat, fill = tot_plot_ord)) + 
  geom_bar(stat= "identity") +
  scale_fill_tableau() +
  xlab("Borough Name") +
  ylab("Total Number of Stations") +
  theme(legend.position = "none")

boro.pop.est2017 <- tibble(
  boroname = c("bronx", "brooklyn", "manhattan", "queens"),
  # from wikipedia, in millions
  pop_est = c(1.47, 2.65, 1.66, 2.36)
  )
# quick comparison: 
boro.stat.count %>% 
  inner_join(boro.pop.est2017, by = "boroname") %>% 
  select(boroname:tot_stat, pop_est)
# A tibble: 4 x 3
  boroname  tot_stat pop_est
  <chr>        <int>   <dbl>
1 manhattan      282    1.66
2 brooklyn       244    2.65
3 queens         125    2.36
4 bronx           99    1.47

Manhattan has the highest number of subway stations, followed by Brooklyn. This may be the case because, even though more people live in Brooklyn and Queens, Manhattan is to some extent, the economic and commercial hub of the city, where people many people work.

boro.stat.count %>% 
  select(-c(tot_stat, ada_percent)) %>% 
  gather("stat_type", "num_stat", ada_stat:not_ada) %>% 
  mutate(stat_type = ifelse(stat_type == "ada_stat", "ADA Station", "Not ADA")) %>% 
  ggplot(aes(tot_plot_ord, num_stat, fill = stat_type)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_tableau(name = "Station Type") +
  xlab("Borough Name") +
  ylab("Number of Subway Stations")

boro.stat.count %>% 
  select(-c(tot_stat, ada_percent)) %>% 
  gather("stat_type", "num_stat", ada_stat:not_ada) %>% 
  mutate(stat_type = ifelse(stat_type == "ada_stat", "ADA Station", "Not ADA")) %>% 
  ggplot(aes(tot_plot_ord, num_stat, fill = stat_type)) +
  geom_bar(stat = "identity", position = "fill") +
  scale_fill_tableau(name = "Station Type") +
  xlab("Borough Name") +
  ylab("Proportion")

Manhattan has by far, the highest number of accessible subway route stations. One reason for this is the way that I have chosen to count subway stations. By counting each route stop individually, even if they stop at the same physical subway station, the number becomes very high because all subway station lines, with the exception of 2 shuttles and the G, pass through Manhattan and make many stops, typically going along the length of the island and making may stops. The proportion of ADA-accessible stations is higher in Manhattan than in the other 3 boroughs as well.

5.3 Analysis by Neighborhood

Now to zoom in a bit more in terms of spatial scale, let’s look at subway stations by neighborhood. I will use the same dataframe as the one used for the borough comparison, but now I will group by neighborhood and calculate the same summary statistics as before. I also group by borough, which does not add anything since the neighborhood names (in the ntanames column) are unique, but it is an easy way to carry along the neighborhood borough names for analysis down the line.

neigh.stat.count <- sub.boro.nogeo %>% 
  # grouping by borough allows for that column to be carried along:
  group_by(boroname, ntaname) %>% 
  summarize(
    tot_stat = n(),
    ada_stat = sum(ada),
    not_ada = tot_stat - ada_stat,
    ada_percent = ada_stat * 100 / tot_stat
    ) %>% 
  # plot order column, same order as for the borough analysis (total stations)
  mutate(boro_order = factor(boroname, levels = boro.stat.count$boroname, labels = c("Manhattan", "Brooklyn", "Queens", "Bronx"))) %>% 
  ungroup()
neigh.stat.count %>% 
  group_by(boro_order) %>% 
  summarize(
    mean_sub_stat = mean(tot_stat),
    med_sub_stat = median(tot_stat),
    mean_ada_stat = mean(ada_stat),
    med_ada_stat = median(ada_stat)
    )
# A tibble: 4 x 5
  boro_order mean_sub_stat med_sub_stat mean_ada_stat med_ada_stat
  <fct>              <dbl>        <dbl>         <dbl>        <dbl>
1 Manhattan          10.1           6            4.21            2
2 Brooklyn            5.42          4            1.22            0
3 Queens              5.21          4.5          1.42            1
4 Bronx               3.54          3            0.75            0
neigh.stat.count %>% 
  ggplot(aes(boro_order, tot_stat, color = boro_order)) +
  geom_jitter(size = 2, alpha = 0.8, width = 0.25) +
  scale_color_tableau() +
  xlab("Borough") +
  ylab("Total Number of Stations per Neighborhood") +
  theme(legend.position = "none")

neigh.stat.count %>% 
  ggplot(aes(boro_order, ada_stat, color = boro_order)) +
  geom_jitter(size = 2, alpha = 0.8, width = 0.25) +
  scale_color_tableau() +
  xlab("Borough") +
  ylab("Number of ADA Subway Stations per Neighborhood") +
  theme(legend.position = "none")

data_summary <- function(x) {
   m <- mean(x)
   ymin <- ifelse(m - sd(x) > 0, m - sd(x), 0)
   ymax <- m + sd(x)
   return(c(y = m, ymin = ymin, ymax = ymax))
}
neigh.stat.count %>% 
  ggplot(aes(boro_order, tot_stat, color = boro_order, fill = boro_order)) +
  geom_violin() +
  stat_summary(fun.data = data_summary, color = "black", alpha = 0.8) +
  scale_color_tableau() +
  scale_fill_tableau() +
  xlab("Borough") +
  ylab("Total Number of Stations per Neighborhood") +
  theme(legend.position = "none")

neigh.stat.count %>% 
  ggplot(aes(boro_order, ada_stat, color = boro_order, fill = boro_order)) +
  geom_violin() +
  stat_summary(fun.data = data_summary, color = "black", alpha = 0.8) +
  scale_color_tableau() +
  scale_fill_tableau() +
  xlab("Borough") +
  ylab("Number of ADA Subway Stations per Neighborhood")

neigh.stat.count %>% 
  ggplot(aes(x = boro_order, y = ada_percent, color = boro_order)) +
  geom_jitter(size = 2, alpha = 0.8, width = 0.25) +
  ylab("Percent ADA by Neighborhood") +
  xlab("Borough Name") +
  scale_color_tableau() +
  theme(legend.position = "none")

neigh.stat.count %>% 
  ggplot(aes(boro_order, ada_percent, color = boro_order, fill = boro_order)) +
  geom_violin() +
  stat_summary(fun.data = data_summary, color = "black", alpha = 0.8) +
  scale_color_tableau() +
  scale_fill_tableau() +
  xlab("Borough") +
  ylab("Percent ADA by Neighborhood")

neigh.stat.count %>% 
  select(-c(not_ada:boro_order)) %>% 
  gather(key = "count_type", value = "n", tot_stat:ada_stat) %>% 
  ggplot(aes(n, fill = count_type)) +
  geom_histogram(bins = 30) +
  facet_wrap(~ count_type) +
  scale_fill_tableau()

# dist skewed, maybe to do with land area:
nyc.neigh.4boro %>% 
  filter(!(str_detect(ntaname, "park") | str_detect(ntaname, "Airport"))) %>% 
  ggplot(aes(shape_area)) +
  geom_histogram(bins = 30)

neigh.stat.count %>% 
  inner_join(nyc.neigh.4boro, by = "ntaname") %>% 
  select(boro_order, ntaname:ada_percent, shape_area) %>%
  mutate(
    area_mi = shape_area / 5280**2,
    `Total Stations` = tot_stat / area_mi,
    `ADA Stations` = ada_stat / area_mi
    ) %>% 
  select(boro_order, `Total Stations`:`ADA Stations`) %>% 
  gather("stat_type", "per_sqmi", `Total Stations`:`ADA Stations`) %>% 
  ggplot(aes(boro_order, per_sqmi, color = boro_order)) +
  geom_jitter(size = 2, alpha = 0.8, width = 0.25) +
  scale_color_tableau() +
  facet_wrap(~ stat_type, ncol = 1, scales = "free_y") +
  theme(legend.position = "none") +
  xlab("Borough") +
  ylab("Number of stations per sq mile")

Now to tackle the original problem, to recreate the NYC neighborhoods map, where each neighborhood is filled based on whether or not there is an accessible station in the neighborhood.

nyc.neigh.4boro.count <- nyc.neigh.4boro %>% 
  full_join(
    neigh.stat.count %>% 
      select(-boroname),
    by = "ntaname"
    )
nyc.neigh.4boro.count %>% 
  ggplot() +
  geom_sf(aes(fill = tot_stat)) +
  scale_fill_viridis_c()

nyc.census.4boro %>% 
  filter(str_detect(ntaname, "park") | str_detect(ntaname, "Airport")) %>% 
  ggplot() + 
  geom_sf()

nyc.neigh.count.prep <- nyc.neigh.4boro.count %>% 
  mutate(
    color_group = ifelse(
      str_detect(ntaname, "park"), "Park", ifelse(
        str_detect(ntaname, "Airport"), "Airport", ifelse(
          !(str_detect(ntaname, "park") | str_detect(ntaname, "Airport")) & is.na(tot_stat), "No stations", ifelse(
            !(str_detect(ntaname, "park") | str_detect(ntaname, "Airport")) & tot_stat > 0 & ada_stat < 1, "No accessible stations", ifelse(
              !(str_detect(ntaname, "park") | str_detect(ntaname, "Airport")) & tot_stat > 0 & ada_stat > 0, "At least one accessible station", "Don't know"
              )
            )
          )
        )
      )
    )
table(nyc.neigh.count.prep$color_group)

                        Airport At least one accessible station 
                              1                              58 
         No accessible stations                     No stations 
                             63                              50 
                           Park 
                              4 
nyc.neigh.count.prep %>% 
  ggplot() +
  geom_sf(aes(fill = color_group), color = "#BFBFBF") +
  scale_fill_manual(values = c("grey30", "#0072B2", "#E69F00", "#999999", "#bae4b3"), name = NULL) +
  geom_sf(data = sub.sf.nyc.crs, color = "#BFBFBF", fill = "black", size = 1.5, pch = 21) +
  theme(axis.text.x = element_blank(), axis.text.y = element_blank())

nyc.neigh.count.prep %>% 
  select(boroname, ntaname, ntacode, color_group, tot_stat:ada_percent) %>% 
  mapview(zcol = "color_group", col.regions = c("grey30", "#0072B2", "#E69F00", "#999999", "#bae4b3"))

An alternative method to blocking out different regions is to simply create separate objects for them, such as how I block out the parks below:

nyc.parks.air.sf <- nyc.census.4boro %>% 
  filter(str_detect(ntaname, "park") | str_detect(ntaname, "Airport"))
nyc.neigh.count.prep %>%
  ggplot() +
  geom_sf(aes(fill = tot_stat)) +
  scale_fill_viridis_c(option = "D", name = "Total Stations") +
  geom_sf(data = nyc.parks.air.sf, fill =  "grey30") +
  theme(axis.text.x = element_blank(), axis.text.y = element_blank())

nyc.neigh.count.prep %>% 
  mutate(ada_stat = ifelse(ada_stat == 0, NA, ada_stat)) %>% 
  ggplot() +
  geom_sf(aes(fill = ada_stat)) +
  scale_fill_viridis_c(option = "D", name = "ADA stations") +
  geom_sf(data = nyc.parks.air.sf, fill =  "grey30") +
  theme(axis.text.x = element_blank(), axis.text.y = element_blank())

nyc.neigh.notada <- nyc.neigh.count.prep %>% 
  filter(ada_stat == 0)
nyc.neigh.count.prep <- nyc.neigh.count.prep %>% 
  mutate(ada_percent = ifelse(ada_percent == 0, NA, ada_percent))
nyc.neigh.count.prep$ada_per_breaks <- as.character(cut(nyc.neigh.count.prep$ada_percent, breaks = seq(0, 100, 20)))
nyc.neigh.count.prep %>% 
  # insert the neighborhoods with subway stations (not ADA) into the fill code:
  mutate(
    ada_per_breaks = ifelse(
      ntacode %in% nyc.parks.air.sf$ntacode, "Park / Airport",
      ifelse(
        ntacode %in% nyc.neigh.notada$ntacode, "No accessible stations", ada_per_breaks
        )
      )
    ) %>% 
  ggplot() +
  geom_sf(aes(fill = ada_per_breaks)) +
  scale_fill_manual(values = c("#c7e9b4", "#7fcdbb", "#41b6c4", "#2c7fb8", "#253494", "#E69F00", "grey30"), na.value = "gray50", name = "Percent ADA") + 
  geom_sf(data = sub.sf.nyc.crs, color = "#BFBFBF", fill = "black", size = 1.5, pch = 21) +
  theme(axis.text.x = element_blank(), axis.text.y = element_blank())

nyc.neigh.count.prep %>% 
  #select(boroname, ntaname, ntacode, color_group, tot_stat:ada_percent) %>% 
  mutate(
    ada_per_breaks = ifelse(
      ntacode %in% nyc.parks.air.sf$ntacode, "Park / Airport",
      ifelse(
        ntacode %in% nyc.neigh.notada$ntacode, "No accessible stations", ada_per_breaks
        )
      ),
    ada_per_breaks = factor(ada_per_breaks)
    )
Simple feature collection with 176 features and 14 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: 971013.5 ymin: 136686.8 xmax: 1067383 ymax: 272844.3
epsg (SRID):    NA
proj4string:    +proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0 +datum=NAD83 +units=us-ft +no_defs
First 10 features:
   borocode  boroname countyfips ntacode                   ntaname
1         3  brooklyn        047    BK88              Borough Park
2         4    queens        081    QN51               Murray Hill
3         4    queens        081    QN27             East Elmhurst
4         4    queens        081    QN07                    Hollis
5         1 manhattan        061    MN06            Manhattanville
6         4    queens        081    QN02 Springfield Gardens North
7         4    queens        081    QN46     Bayside-Bayside Hills
8         3  brooklyn        047    BK25                 Homecrest
9         2     bronx        005    BX59     Westchester-Unionport
10        4    queens        081    QN41      Fresh Meadows-Utopia
   shape_leng shape_area tot_stat ada_stat not_ada ada_percent boro_order
1    39247.23   54005019        4        0       4          NA   Brooklyn
2    33266.90   52488277       NA       NA      NA          NA       <NA>
3    19816.71   19726846       NA       NA      NA          NA       <NA>
4    20976.34   22887773       NA       NA      NA          NA       <NA>
5    17040.69   10647077        3        0       3          NA  Manhattan
6    25433.58   28428438       NA       NA      NA          NA       <NA>
7    43014.81   80797536       NA       NA      NA          NA       <NA>
8    27514.02   29991967        1        0       1          NA   Brooklyn
9    26557.41   23945210        2        0       2          NA      Bronx
10   22106.43   27774853       NA       NA      NA          NA       <NA>
              color_group         ada_per_breaks
1  No accessible stations No accessible stations
2             No stations                   <NA>
3             No stations                   <NA>
4             No stations                   <NA>
5  No accessible stations No accessible stations
6             No stations                   <NA>
7             No stations                   <NA>
8  No accessible stations No accessible stations
9  No accessible stations No accessible stations
10            No stations                   <NA>
                         geometry
1  MULTIPOLYGON (((990897.9 16...
2  MULTIPOLYGON (((1038593 221...
3  MULTIPOLYGON (((1022728 217...
4  MULTIPOLYGON (((1051540 201...
5  MULTIPOLYGON (((999174.3 23...
6  MULTIPOLYGON (((1050735 185...
7  MULTIPOLYGON (((1048560 223...
8  MULTIPOLYGON (((995746.9 16...
9  MULTIPOLYGON (((1028696 245...
10 MULTIPOLYGON (((1045896 205...
nyc.neigh.count.prep %>% 
  mutate(
    ada_per_breaks = ifelse(
      ntacode %in% nyc.parks.air.sf$ntacode, "Park / Airport",
      ifelse(
        ntacode %in% nyc.neigh.notada$ntacode, "No accessible stations", ada_per_breaks
        )
      ),
    ada_per_breaks = replace_na(ada_per_breaks, "No stations")
    ) %>% 
   select(boroname, ntaname, ntacode, ada_per_breaks, tot_stat:ada_percent) %>% 
  mapview(zcol = "ada_per_breaks", col.regions = c("#c7e9b4", "#7fcdbb", "#41b6c4", "#2c7fb8", "#253494", "#E69F00", "gray50", "grey20"))

5.4 Census tract buffer analysis

Now to pivot to the census tracts. The tracts themselves are too small in area to learn something useful by simply counting the number of stations within the area of each tract. Instead, I will use buffer analysis to determine how many stations are within a certain range from each census tract center.

First, the center of each tract can be found using the st_centroid function from the sf package:

tract.cent <- st_centroid(nyc.census.4boro)
# result mapped: 
nyc.census.4boro %>% 
  ggplot() +
  geom_sf(color = "#BFBFBF") +
  geom_sf(data = tract.cent, size = 1, color = "#4E79A7")

Then, those centroid points can be used as the center to draw a circle with a chosen radius, specified in feet with the dist argument in the st_buffer function call below. I chose to draw buffers with a half-mile radius, or 2,640 ft, which may even be too generous of a radius for someone getting around in a wheelchair.

The resulting geometry is too chaotic to map all at once, but I can select two census tracts to demonstrate as an example.

# half-mile buffer
tract.cent.halfmi.buff <- st_buffer(tract.cent, dist = 2640)
#example
nyc.census.4boro %>% 
  ggplot() +
  geom_sf(color = "#BFBFBF") +
  geom_sf(
    data = tract.cent.halfmi.buff %>%
      filter(boroct2010 == 1009800 | boroct2010 == 1018400 ),
    alpha = 0.8, color = "#F28E2B", fill = "#F28E2B"
    ) +
  geom_sf(data = sub.sf.nyc.crs, color = "black", size = 1)

Above are two examples of buffers drawn represented as orange circles, with the subway station stops overlaid in black. These new spatial objects inherited the geometry from the NYC tract maps, and have the same properties as other sf objects. From here, it’s only a matter of using a spatial join to determine how many subway station stops fall into each buffer.

Below is an example of the join for the more northern buffer, the one that’s at the top of Central Park. This buffer captures 5 separate subway stations: 2 for the red line and 3 for the green line.

st_join(sub.sf.nyc.crs, tract.cent.halfmi.buff) %>% 
  st_set_geometry(NULL) %>% 
  filter(boroct2010 == 1018400) %>% 
  select(stat_name:route_name, boroct2010) %>% 
  arrange(stat_name, route_name)
                              stat_name   ada route_name boroct2010
1 irt_lenox_110th st-central park north FALSE          2    1018400
2 irt_lenox_110th st-central park north FALSE          3    1018400
3                    irt_lenox_116th st FALSE          2    1018400
4                    irt_lenox_116th st FALSE          3    1018400
5                irt_lexington_110th st FALSE          6    1018400
6                irt_lexington_116th st FALSE          6    1018400
7                irt_lexington_125th st  TRUE          4    1018400
8                irt_lexington_125th st  TRUE          5    1018400
9                irt_lexington_125th st  TRUE          6    1018400

Because multiple trains routes stop at most of these station, by chosen my method, I would then count that the census tract associated with that buffer is in range of 9 subway station route stops.

For the full join, I will break it up by borough, because all boroughs, except for Brooklyn and Queens, are completely separated by water.

sub.sf.nyc.crs.boro <- sub.sf.nyc.crs %>% 
  inner_join(
    sub.boro %>% 
      st_set_geometry(NULL) %>% 
      as_tibble() %>% 
      select(stat_name, boroname) %>% 
      distinct(),
    by = "stat_name"
  )

sub.join.borolim <- sub.sf.nyc.crs.boro %>% 
  # manhattan
  filter(boroname == "manhattan") %>% 
  st_join(
    tract.cent.halfmi.buff %>% 
      filter(boroname == "manhattan")
    ) %>% 
  st_set_geometry(NULL) %>% 
  bind_rows(
    # bronx
    sub.sf.nyc.crs.boro %>% 
      filter(boroname == "bronx") %>% 
      st_join(
        tract.cent.halfmi.buff %>% 
          filter(boroname == "bronx")
        ) %>% 
      st_set_geometry(NULL)
    ) %>% 
  bind_rows(
    # brooklyn and queens together
    sub.sf.nyc.crs.boro %>% 
      filter(boroname == "brooklyn" | boroname == "queens") %>% 
      st_join(
        tract.cent.halfmi.buff %>% 
          filter(boroname == "brooklyn" | boroname == "queens")
        ) %>% 
      st_set_geometry(NULL)
    )

nyc.4boro.buff.join <- nyc.census.4boro %>% 
  full_join(
    sub.join.borolim %>% 
      group_by(boroct2010) %>% 
      summarize(
        tot_stat = n(),
        ada_stat = sum(ada),
        not_ada = tot_stat - ada_stat,
        ada_percent = ada_stat * 100 / tot_stat
      ) %>% 
      ungroup(),
    by = "boroct2010"
    )


nyc.4boro.buff.join %>% 
  ggplot() +
  geom_sf(aes(fill = tot_stat)) +
  scale_fill_viridis_c(option = "A") +
  geom_sf(data = sub.sf.nyc.crs, color = "white", size = 1)

nyc.4boro.buff.join %>% 
  ggplot() +
  geom_sf(aes(fill = tot_stat), color = "purple4") +
  scale_fill_viridis_c(option = "D")

mapview(nyc.4boro.buff.join, zcol = "tot_stat")
nyc.4boro.buff.join %>% 
    mutate(
    color_group = ifelse(
      str_detect(ntaname, "park"), "Park", ifelse(
        str_detect(ntaname, "Airport"), "Airport", ifelse(
          !(str_detect(ntaname, "park") | str_detect(ntaname, "Airport")) & is.na(tot_stat), "No stations", ifelse(
            !(str_detect(ntaname, "park") | str_detect(ntaname, "Airport")) & tot_stat > 0 & ada_stat < 1, "No accessible stations", ifelse(
              !(str_detect(ntaname, "park") | str_detect(ntaname, "Airport")) & tot_stat > 0 & ada_stat > 0, "At least one accessible station", "Don't know"
              )
            )
          )
        )
      )
    ) %>% 
  ggplot() +
  geom_sf(aes(fill = color_group)) +
  scale_fill_manual(values = c("grey30", "#0072B2", "#E69F00", "#999999", "#bae4b3")) +
  geom_sf(data = nyc.neigh.4boro, color = "white", size = 0.5, fill = NA) +
  geom_sf(data = sub.sf.nyc.crs, color = "#BFBFBF", fill = "black", size = 1, pch = 21)

  #geom_sf(data = sub.sf.nyc.crs, color = "#BFBFBF", size = 1)

nyc.4boro.buff.join %>% 
  mutate(ada_stat = ifelse(ada_stat == 0, NA, ada_stat)) %>% 
  ggplot() +
  geom_sf(aes(fill = ada_stat)) +
  geom_sf(data = nyc.neigh.4boro, color = "blue", fill = NA) +
  scale_fill_viridis_c(option = "A") +
  geom_sf(data = sub.sf.nyc.crs, color = "white", size = 1)

6 Summary and Conclusion

Conclusions

(Compare the inspiration map to the created map, and then to the census tract map)

7 Session Info

sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2.2  mapview_2.6.3   ggthemes_4.0.1  sf_0.7-2       
 [5] forcats_0.3.0   stringr_1.3.1   dplyr_0.7.8     purrr_0.2.5    
 [9] readr_1.3.1     tidyr_0.8.2     tibble_2.0.1    ggplot2_3.1.0  
[13] tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0        lubridate_1.7.4   lattice_0.20-38  
 [4] png_0.1-7         class_7.3-15      utf8_1.1.4       
 [7] assertthat_0.2.0  digest_0.6.18     mime_0.6         
[10] R6_2.3.0          cellranger_1.1.0  plyr_1.8.4       
[13] backports_1.1.3   stats4_3.5.2      evaluate_0.12    
[16] e1071_1.7-0       httr_1.4.0        pillar_1.3.1     
[19] rlang_0.3.1       lazyeval_0.2.1    readxl_1.2.0     
[22] rstudioapi_0.9.0  raster_2.8-4      rmarkdown_1.11   
[25] labeling_0.3      webshot_0.5.1     htmlwidgets_1.3  
[28] munsell_0.5.0     shiny_1.2.0       broom_0.5.1      
[31] compiler_3.5.2    httpuv_1.4.5.1    modelr_0.1.2     
[34] xfun_0.4          pkgconfig_2.0.2   base64enc_0.1-3  
[37] htmltools_0.3.6   tidyselect_0.2.5  codetools_0.2-16 
[40] fansi_0.4.0       viridisLite_0.3.0 crayon_1.3.4     
[43] withr_2.1.2       later_0.7.5       grid_3.5.2       
[46] satellite_1.0.1   nlme_3.1-137      jsonlite_1.6     
[49] xtable_1.8-3      gtable_0.2.0      DBI_1.0.0        
[52] magrittr_1.5      units_0.6-2       scales_1.0.0     
[55] cli_1.0.1         stringi_1.2.4     promises_1.0.1   
[58] sp_1.3-1          leaflet_2.0.2     xml2_1.2.0       
[61] generics_0.0.2    tools_3.5.2       glue_1.3.0       
[64] hms_0.4.2         crosstalk_1.0.0   yaml_2.2.0       
[67] colorspace_1.4-0  classInt_0.3-1    rvest_0.3.2      
[70] knitr_1.21        bindr_0.1.1       haven_2.0.0